[Go] Composition of the generic GO slim

Valerie Wood val at sanger.ac.uk
Fri May 2 08:46:18 PDT 2008


I think it is good idea for the consortium to provide an official 'GO 
slim', and advise people how they may want to alter the slim to fit 
their individual purpose.

A useful generic GO slim has a number of qualities (I have tried to list 
these below, please suggest any additional ones, I hadn't really thought 
before about what the rules were I used for making a slim so this is the 
first time I have documented them). Following the 'guidelines' below I 
have suggested a set of process which I think should make up the generic 
process slim.

Perhaps we could use this as a starting point, and people can suggest 
additional terms (with reasons) or terms which should be removed. This 
provides good coverage of basic cellular processes but would need 
extending to cover multicellular processes.

GO Slim criteria

1. The generic slim should be  as organism independent as possible 
(although clearly some terms will not be applicable to single celled 
eukaryotes and some eukaryotic terms will not be applicable to prokaryotes)

2. The slim should cover AS MANY genes with annotated processes as possible

3. The slim should cover AS MANY genes with annotated processes with the 
smallest number of leaf node terms (if you include too many terms and it 
becomes too large and you start to loose the advantages of a slim).

4. It might be useful to try to avoid terms with an excessively small or 
large number of small number of annotations (i.e ideally your terms will 
not have an extreme distributions for your histogram)

5. Preferably the slim should include  sibling terms with a large 
overlaps between them. If you choose two siblings with 200 genes 
annotated to each, and the majority of the annotations  overlap, it is 
usually better to select the parent node (i.e replace 2 terms by one 
single term). Conversely, if the child terms of a  node fall into 
distinct non-overlapping subsets, it might be more informative to 
include both child terms in your slim (see also point 7 below)

6. For most purposes you need to include a representative term for all 
biologically relevant processes, by including terms which are meaningful 
to biologists.

7. If you are using your slim for data analysis (and not just for 
vizualization) you need to include terms which will allow you to 
distinguish genes bases on their biological properties.
For example, it is not good to lump all genes involved in transport 
under transport because the genes annotated to distinct child terms; 
vesicle -mediated transport, protein targeting, transmembrane transport, 
are VERY different in term of their i) viability ii) species 
distribution iii) number of interaction partners iv) copy number v) 
expression pattern, so it does not make sense to lump 
them together in your slim set.

Using these criteria  this is the basic cellular process eukaryotic slim 
I use (or slight variations of): The number of annotations (for pombe 
obviously) is in parentheses (protein coding only).

GO:0055085 transmembrane transport (278)
GO:0006913 nucleocytoplasmic transport (114)
GO:0006605 protein targeting (162)
GO:0016192 vesicle-mediated transport (266)
GO:0051186 cofactor metabolic process (139)
GO:0006766 vitamin metabolic process (57)
GO:0006790 sulfur metabolic process (45)
GO:0006807 nitrogen compound metabolic process (224)
GO:0055086 nucleobase, nucleoside and nucleotide metabolic process (118)
GO:0005975 carbohydrate metabolic process (199)
GO:0006629 lipid metabolic process (201)
GO:0006399 tRNA metabolic process (125)
GO:0006520 amino acid metabolic process (187)
GO:0006412 translation (357)
GO:0006259 DNA metabolic process (296)
GO:0006508 protolysis (223)
GO:0005975 carbohydrate metabolic process (199)
GO:0016071 mRNA metabolic process (204)
GO:0043413 biopolymer glycosylation (65) possibly drop?
GO:0006464 protein modification process (585)
GO:0007059 chromosome segregation (186)
GO:0007049 cell cycle (552)
GO:0007010 cytoskeletal organization and biogenesis (236)
GO:0000910 cytokinesis (145)
GO:0007165 signal transduction (362)
GO:0006457 protein folding (80)
GO:0042254 ribosome biogenesis and assembly (223)
GO:0045229 external encapsulating structure organization and biogenesis 
(124)
GO:xxxxxxxx general transcription (see note *1 below)
GO:0032569 specific transcription from RNA polymerase II promoter (102)
(total 424 for all transcription)
GO:0000902 cell morphogenesis (86)
GO:0006338 establishment and/or maintenance of chromatin architecture (231)
GO:reproductive process (182)
GO:0007005 mitochondrion organization and biogenesis (251)
GO:0006091 generation of precursor metabolites and energy (113)
GO:0007031 peroxisome organization and biogenesis (20)

At this point there are about ~100 pombe genes (out of the 3960 with an 
annotated process term) which aren't included in the slim

I could also include....
vacuolar transport (91) reduces by 6 (most also annotated to protein 
targeting)
telomere maintenance (54) reduces by 6 (most also annotated to DNA met)
snoRNA metabolic process (10) reduces by 2
...to improve coverage (very slightly)

Finally I include
GO:0006950 response to stress (444)
this terms has overlaps with most other processes so is largely 
redundant but are useful.

This  leaves ~30 pombe with a process annotation unassigned to the GO 
slim; these are often to terms like homeostasis and its children, or 
otherwise uniformative terms

For some purposes I would also include
GO:0065007 biological regulation  (1021)
but I don't know if this is a good term to include in a generic slim

To make this work for multicellular eukaryotes, we would probably 
want to add non-cellular process terms like:

developmental process
immune system process


* Note1 it is not currently possible to retrieve genes involved in 
general transcription as opposed to gene specific transcription (i.e RNA 
I,II and III polymerases etc),  with a single query. This is also 
important for enrichment as the genes in these 2 sets are very different 
in terms of species distribution, copy number and viability. I requested 
a grouping term for these processes a while ago and hopefully this will 
be implemented shortly.

See:
https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764


Val






Ben Hitz wrote:
> Emily -
> I have interest in working on the generic go slim; I need it (or  
> something similar) to define graphics for an interaction network.
>
> Ben
>
>
> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote:
>
>   
>> Hi,
>>
>> From replying to a user request, I've just been having a quick look at
>> the composition of the generic GO slim, and relating the GO terms
>> included to the number of annotations displayed by AmiGO.
>>
>> Should, for instance, the 'cell recognition' term still be included in
>> the generic GO slim? - it has only been annotated to 182 gene  
>> products,
>> whereas its sibling terms: 'cell division', 'cell cycle' and 'cell
>> motility', have not been included even though they (directly or
>> indirectly) have been annotated to more than 1,200 gene products each.
>> Similarly, the term 'cytoplasm organization and biogenesis' is in  
>> the GO
>> slim but only has 113 gps annotated, whereas the 'membrane  
>> organisation
>> and biogenesis' term has been annotated to 1,509 gps.
>>
>> I was just wondering what the goal of the generic GO slim is... if  
>> terms
>> are selected on the basis that as many annotated gene products from
>> different organisms should get mapped to descriptive GO terms before
>> they are caught by the BP, MF, CC root terms (while also providing a
>> full selection of terms across the whole GO vocabulary), should we  
>> think
>> of reviewing its some of its composition in relation to overall
>> annotation frequency? Or should the GO slim be kept as stable as  
>> possible?
>>
>> Cheers,
>> Emily
>>
>> -- 
>>
>>
>>
>> ------------------------------------------------------------------
>>
>>    Emily Dimmer Ph.D.
>>    GOA Coordinator
>>    EMBL-EBI
>>    Wellcome Trust Genome Campus
>>    Hinxton
>>    Cambridge CB10 1SD, U.K.
>>    Tel:     +44 1223 494654
>>    Fax:    +44 1223 494468
>>    email:  edimmer at ebi.ac.uk
>>    URL:    http://www.ebi.ac.uk/goa
>>
>>
>> _______________________________________________
>> Go mailing list
>> Go at geneontology.org
>> http://fafner.stanford.edu/mailman/listinfo/go
>>     
>
> --
> Ben Hitz
> Senior Scientific Programmer ** Saccharomyces Genome Database ** GO  
> Consortium
> Stanford University ** hitz at genome.stanford.edu
>
>
>
> _______________________________________________
> Go mailing list
> Go at geneontology.org
> http://fafner.stanford.edu/mailman/listinfo/go
>
>
>
>   


-- 
---------------------------------------------------------------------------
Valerie Wood			 Tel: 01223 496909
S. pombe Genome Project		 Fax: 01223 494919 		       
Wellcome Trust Sanger Institute	 email: val at sanger.ac.uk
Wellcome Trust Genome Campus	 http://www.genedb.org/genedb/pombe 
Hinxton, Cambridge, CB10 1HH	 http://www.sanger.ac.uk/Projects/S_pombe



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


More information about the Go mailing list