TimesVector (v1.0)
August 24, 2016
User Manual
1. Pre-requisites
Required
python modules:
1)
Scipy
2)
Numpy
Required
R libraries:
1)
skmeans
(https://cran.r-project.org/web/packages/skmeans/index.html)
2)
ggplot2
(https://cran.r-project.org/web/packages/ggplot2)
2. Installing TimesVector
Download TimesVector at http://biohealth.snu.ac.kr/TimesVector
Uncompress
file and export TIMESVECTOR path.
$ tar –xzvf TimesVector_v1.0.tar.gz
$ export
TIMESVECTOR=/<path to TimesVector>/bin
$ export
PATH=$PATH:$TIMESVECTOR
3. Running TimesVector
$ TimesVector
usage:
bin/TimesVector [ h | gctdko ]
This script runs
TimesVector.
Paramters(all
mandatory):
-g The path to the
gene expression file
-c Number of
classes (INT)
-t Number of time
points per class (INT)
-d Type of data
['m': Microarray, 'n': NGS]
-k K numer of
clusters (INT)
-o Output directory
for results
-h
Show this message
All parameters
are mandatory.
The
gene expression file is the only required input file. The format of the gene
expression file is shown in Section 4.
-c is the number of sample conditions (or
phenotypes) in the gene expression file (INTEGER)
-t is the number of time points in each
sample condition (INTEGER)
-d is the type of the data. If gene
expression data is from microarray data 'm'. If data is from high throughput
sequencing data (i.e., RNA-seq) 'n' (CHARACTER).
-k is the number of clusters desired to
detect (INTEGER). We recommend to choose a K close to the following equation.
K = −85.71 + 28.57x,
where
x is the product of C (# of conditions) and T (# of time points).
-o is the output directory for the
clustering results
Example
The
gene expression file of GSE11651 is included in the 'example' directory, 'GSE11651_data.txt'.
The
command line for executing TimesVector using the example data will be as
follows,
$ TimesVector -g example/GSE11651_data.txt -c 5 -t 3 -d m
-k 300 -o results
4. Gene Expression File Format
The
gene expression file is a TAB
delimited gene expression matrix.
Header
The
first line of the file serves as a header.
The first column 'GeneID' of the header
is mandatory and must be used as is.
The
following columns represent each sample conditions and their associated time
points. The conditions need to be in order as well as the time points. Each
column represents a single time point of a condition.
The
name of a column follows the following syntax:
'Condition_Time'_'Point' (e.g, DV10_Day2)
The
condition and time point are separated by an under line character ("_"). The
name of the condition and time point can be any string of characters.
For
example, for a time-series data with three conditions (A, B and C) with three
time points (20min, 40min and 60min), the header will look as follows:
GeneID A_20min
A_40min A_60min B_20min B_40min B_60min C_20min C_40min C_60min
Gene expression values
Each
row following the header represents a gene. The first column represents the
gene ID. The remaining columns represent the gene expression value associated
with the condition and time point of each column.
A
toy example file with three conditions (i.e., A, B, C) each having three time
points (i.e., 20min, 40min, 60min) of five genes is shown below:
GeneID |
A_20min |
A_40min |
A_60min |
B_20min |
B_40min |
B_60min |
C_20min |
C_40min |
C_60min |
P53 |
5 |
9 |
10 |
6 |
8 |
9 |
8 |
4 |
2 |
bZIP |
3 |
13 |
18 |
4 |
15 |
21 |
5 |
1 |
1 |
WRKY |
25 |
27 |
28 |
24 |
25 |
26 |
25 |
20 |
15 |
ERF |
8 |
11 |
12 |
9 |
10 |
11 |
9 |
8 |
5 |
5. Output files
There
are a total of three sets of output files. Assuming that K was set to 300,
1)
The
first set of output files are – K300.cluster, K300.prototype. These files are
the result files output from skmeans. The 300 represents the number of clusters
to be detected (the input parameter '-k').
The K300.cluster file shows the total list of genes and their assigned cluster
ID.
The K300.prototype file shows the centroid values of each cluster
2)
The
second set of output files are the result files of cluster classification –
DEP, SEP and NEP cluster files.
DEP_clusters.dat: The list of clusters identified as DEP type clusters.
DEP_genes.dat: The list of genes in the DEP type clusters.
SEP_clusters.dat: The list of clusters identified as SEP type clusters.
SEP_genes.dat: The list of genes in the SEP type clusters.
NEP_clusters.dat: The list of clusters that failed to be classified as a DEP or
SEP.
NEP_genes.dat: The list of genes in that failed to be assigned to a DEP or SEP
cluster.
3)
The
third set of output files are the visualized plots of DEP and SEP clusters and
their genes. These are located in the 'plots' directory within the output
directory.
The plots for the cluster representatives are located in the DEP_clusters and
SEP_clusters directories.
e.g., results/plots/DEP_clusters/cluster_2_repr.pdf
The plots for the genes in each cluster are located in the DEP_genes and
SEP_genes directories.
e.g., results/plots/DEP_genes/cluster_2_genes.pdf