I've started writing a tutorial on how to use
XCluster. Feedback is appreciated.Simply double-click the application, and answer the questions as they come up. Your input file must conform to the expected file format, described below.
UNIXYou can either simply type cluster, and answer the questions as they come up, or use the following command line arguments:
* -f filename * -g 0|1|2 ; 0 indicates no gene clustering, 1 indicates non-centered metric, 2 indicates centered metric when clustering genes. 1 is default. * -e 0|1|2 ; 0 indicates no experiment clustering, see above for 1 and 2. 0 is the default. * -p 0|1 ; whether to use pearson correlation (1), or Euclidean distance (0). 1 is the default. * -s 0|1 ; whether to make a SOM (1). 0 is the default. * -x specify x dimension of SOM * -y specify y dimension of SOM * -r 0|1 ; whether to seed the random number generator with the time when making a SOM. 1 is the default. * -k num ; how many k-means clusters to make. num is an integer, greater than 1, and preferably less than a gazillion. * -l 0|1 ; whether to log transform the data. * -u string ; a unique identifier by which to name the output files, instead of basing their names on the input file. eg -u 888 will generate 888.cdt as an output file.
- The file format is very inflexible. I'm toying with making it allow
a variable number of columns prior to the experimental data. Don't hold
your breath though. Would really like to transition to a flexible XML
format, which Paul Spellman is (supposedly) defining. Will require
a loty of new code though....
- sometimes it can happen that a k-means clustering goes on forever.
Obviously not a very useful feature!. I'll investigate this when I
have time.
- future plans include:
1. Making the program automatically optimize the number of
clusters for both k-means and SOMs. I've actually implemented
a couple of methods for this in my development version, but
neither of them was satisfactory - Tibshirani has come up with
a new method, that I'll check out implementing as time permits.
2. Additional optimizations to CalculateCorrelation function.
I've actually done this, but whilst it does make the code about
15-20% faster, it also makes it unreadable, so I've backed off
from this for the moment.
3. Add in non-parametric distance measure, such as Spearman Rank
Correlation coefficient.
4. Add in data filtering, based on Orly's work.
This has also been done, but is still considered as 'in
testing', and will have to wait until interpolation of missing
data is done, so that the full data set can be used.
5. Split the source up into several source files
6. Add in the jacknife correlation, and the 'Quality Cluster'
Method of Heyer et al, which is a pretty cool method.
7. Add in the divisive clustering method of Alon et al.
Haven't yet investigated thoroughly how easy/hard this is.
8. Add in autoweighting, as already implemented in Mike Eisen's
clustering program.
History:
New (as of 9-21-99)
- Finished version 2.51
Minor changes to code, to streamline parsing of user input.
- Made it so that genes don't have to be clustered.
Old (as of 7-19-99)
- Finished version 2.5
Has the ability to generate k-means clusters.
- Added in code to log transform the data (in base 2) in case it
wasn't already log transformed.
- Optimized SOM code somewhat.
- Changed parsing code so that it will compile and run unmodified on
all platforms.
Old (as of 5-13-99)
- Finished Version 2.
Has the ability to generate Self Organizing Maps, which can then be
viewed at http://genome-www.stanford.edu/~sherlock/SOMview.html
- The input data is filtered when making a SOM, such that all
expression profiles that do not vary at least 2 fold from their max
to min are removed. These are printed out into a file with a .fil
extension
- The user has the ability to decide whether to not seed the random
number generator with the current time. If they want to compare
SOMs with and without clustering by experiment, this is useful,
because you can guarantee the same organization if you don't seed
the random number generator.
- Fixed bug where using relative pathways gave the wrong output
filenames. Now can use:
1. /home/users/sherlock/sample.txt OR,
2. ../../sample.txt
equally well.
- Did some optimisations to the CalculateCorrelation function to speed
it up.
Old (as of 5-6-99)
- Added in the ability to choose between using Euclidean distance, or
pearson correlation for use as a similarity metric. Can be set via
the command line option -p, where 0 means use Euclidean, and 1 means
use pearson. 1 is the default.
Old (as of 5-1-99)
- Found stupid and nasty bug such that even though the code was
correctly switching the node orders to bring together similar
genes/experiments, these switches were not reflected in the tree
files. This bug was fixed, and the fix tested, and it's now better!
Old (as of 4-16-99)
- Was storing each correlation in two places, which Paul pointed out
was unnecessary, so now only store it in one place. Makes it
slightly faster.
Old (as of 4-12-99)
- Fixed a couple of small bugs eg. forgetting to malloc the extra
byte for the '\0' in ** all ** the strings!
- made the code somewhat more readable, since:
*(darray+x*y) is equivalent to darray[x*y]
the second version is a lot more readable I think.
- Made linked list implementation more efficient, such that when
a correlation is to be inserted into a full list, then the memory
of the lowest correlationRec is reused, and the pointers manipulated
accordingly. Saves a call to malloc and free.
- Added the ability to take command line arguments, so can make automatable:
Usage: cluster -f filename /* filename (duh!) */
-g 1|2 /* genes: 2=use centered metric
1 is default */
-e 0|1|2 /* experiments: 0=don't cluster
1=cluster
2=use centered metric
0 is default */
eg. cluster -f sample.txt -g 2 -e 2
cluster -f sample.txt
cluster -e 2 -f sample.txt ...you get the idea
If you supply any command line arguments, you must supply the filename.
Old ! (as of 4-12-99)
The clustering program is implemented in C. To use resources more
efficiently it only stores the 13 highest correlations for any
particular gene (I played around with this number on a set of ~6200
genes by 80 experiments, and 13 was the fastest). Also, since sqrt(a) *
sqrt(b)=sqrt(a*b), I was able to save a call to sqrt. The program,
was compiled using:
gcc -Wall latest_cluster.c -o cluster -lm -O3 -ansi
As a couple of benchmarking exercises, cluster takes:
11 seconds: 800 genes x 82 experiments (alberich)
8 minutes: 6200 genes x 82 experiments (alberich)
1 hour: 16384 genes x 250 experiments (daisy)
The program prompts the user for an input file, asks whether the data
should be clustered by experiments, in addition to genes, and asks
whether to cluster the data using a centered correlation metric. The
progress is given by indicating every 100 correlations that have been
calculated, and every 100 nodes that have been clustered (for genes)
or every node that has been clustered (for experiments). The program
then outputs 2 or 3 files, identically named as the input file, but
with the extensions .cdt, .gtr, and .atr (if experiments were
clustered).
File formats:
The program expects a specific input file format. If this is violated
the program will give unexpected results. If the input format
specified here is unsatisfactory, or too limiting, we can discuss
making changes.
Line 1: UID\tNAME\tGWEIGHT\tExp1\tExp2...Expn
Line 2: EWEIGHT\t\t\t1\t1\t...1
Line 3: ORF1\tNAME1\t1\t0.23\t0.45\t...0.43
Line 4: ORF2\tNAME2\t1\t0.56\t-0.10\t..0.34
ie: The first line contains three columns (they can be named
anything), followed by the names of all the experiments (tab
delimited). The second lines contains 3 columns (they can be named
anything), followed by the EWEIGHTS for each experiment (usually have
the value 1, but treated internally as a float). The third, and
subsequent lines have the ORF name as the first column, the gene name
as the second column, and the GWEIGHT as the third column (again,
whose value is usually 1, but is treated internally as a float). The
ORF name and GENE name fields can be up to 1024 characters in length,
so can be descriptive data instead. The remainder of the line is made
up of tab delimited experimental data. Where there is a null data
point, there will be two tabs next to each other.
an example file is :
UID NAME GWEIGHT spo0 spo30 spo2 spo5 spo7 spo9 spo11
EWEIGHT 1 1 1 1 1 1 1
YAL003W EFB1 1 0.23 -1.79 -1.29 -1.56 -0.27
YAL004W YAL004W 1 0.41 -0.38 -0.89 -1.06 -1.6 -1.84 -1.6
YAL005C SSA1 1 0.61 -0.07 -1.29 -1.29 -2 -1.84 -2.25
YAL010C MDM10 1 0.16 -0.15 -0.76 -1.25 -1.89 -1.74 -1.6
YAL012W CYS3 1 0.03 1.39 -0.84 -1.64 -2.84 -2.47 -2.4
YAL015C NTG1 1 -0.18 -0.18 -0.62 -1.32 -1.69 -1.43 -1.79
...etc
Ouput files:
The .cdt file looks like this:
GID YORF NAME GWEIGHT spo0 spo2 spo30 spo11 spo9 spo7 spo5
AID ARRY0X ARRY2X ARRY1X ARRY6X ARRY5X ARRY4X ARRY3X
GENE37X YBR025C YBR025C 1 0.50 -1.94 -2.64 0.73 0.51 -0.18 -1.00
GENE26X YBL042C YBL042C 1 0.26 -1.06 -2.06 1.58 1.33 0.54 -0.43
...etc
where everything is tab delimited. The second line is only printed if
experiments, as well as genes were clustered. The ARRY0X, and GENE37X
are used by the TreeViewer program to determine how to draw the binary
trees, in conjunction with the .gtr and .atr files (below):
.gtr:
NODE1X GENE51X GENE24X 0.997965
NODE2X GENE41X GENE40X 0.997813
NODE3X GENE90X GENE89X 0.997782
NODE4X GENE84X GENE32X 0.996293
NODE5X GENE8X GENE3X 0.995884
NODE6X GENE91X GENE54X 0.995747
NODE7X GENE80X GENE42X 0.995286
NODE8X GENE70X GENE31X 0.994681
NODE9X GENE63X GENE36X 0.994233
...etc
.atr
NODE1X ARRY4X ARRY3X 0.948027
NODE2X ARRY6X ARRY5X 0.940996
NODE3X NODE2X NODE1X 0.902631
NODE4X ARRY2X ARRY1X 0.785001
NODE5X NODE4X NODE3X 0.662225
NODE6X NODE5X ARRY0X -0.082621
If there's any questions over: