[Go] Composition of the generic GO slim

Judith Blake jblake at informatics.jax.org
Fri May 2 11:15:49 PDT 2008


Val,
I still maintain that users need to be able to generate grouping 
criteria based on their usage.    I think we could go back to the fly 
genome paper and see the primary molecular divisions that seemed most 
useful to describe the genome properties.  like 'reproduction' and 
'metabolism'.  Anything more granular is specific to the user.  A 
mapping on this basis would likely include fewer than 20 terms and would 
subdivide trees.

judy

Valerie Wood wrote:
> I think it is good idea for the consortium to provide an official 'GO 
> slim', and advise people how they may want to alter the slim to fit 
> their individual purpose.
>
> A useful generic GO slim has a number of qualities (I have tried to list 
> these below, please suggest any additional ones, I hadn't really thought 
> before about what the rules were I used for making a slim so this is the 
> first time I have documented them). Following the 'guidelines' below I 
> have suggested a set of process which I think should make up the generic 
> process slim.
>
> Perhaps we could use this as a starting point, and people can suggest 
> additional terms (with reasons) or terms which should be removed. This 
> provides good coverage of basic cellular processes but would need 
> extending to cover multicellular processes.
>
> GO Slim criteria
>
> 1. The generic slim should be  as organism independent as possible 
> (although clearly some terms will not be applicable to single celled 
> eukaryotes and some eukaryotic terms will not be applicable to prokaryotes)
>
> 2. The slim should cover AS MANY genes with annotated processes as possible
>
> 3. The slim should cover AS MANY genes with annotated processes with the 
> smallest number of leaf node terms (if you include too many terms and it 
> becomes too large and you start to loose the advantages of a slim).
>
> 4. It might be useful to try to avoid terms with an excessively small or 
> large number of small number of annotations (i.e ideally your terms will 
> not have an extreme distributions for your histogram)
>
> 5. Preferably the slim should include  sibling terms with a large 
> overlaps between them. If you choose two siblings with 200 genes 
> annotated to each, and the majority of the annotations  overlap, it is 
> usually better to select the parent node (i.e replace 2 terms by one 
> single term). Conversely, if the child terms of a  node fall into 
> distinct non-overlapping subsets, it might be more informative to 
> include both child terms in your slim (see also point 7 below)
>
> 6. For most purposes you need to include a representative term for all 
> biologically relevant processes, by including terms which are meaningful 
> to biologists.
>
> 7. If you are using your slim for data analysis (and not just for 
> vizualization) you need to include terms which will allow you to 
> distinguish genes bases on their biological properties.
> For example, it is not good to lump all genes involved in transport 
> under transport because the genes annotated to distinct child terms; 
> vesicle -mediated transport, protein targeting, transmembrane transport, 
> are VERY different in term of their i) viability ii) species 
> distribution iii) number of interaction partners iv) copy number v) 
> expression pattern, so it does not make sense to lump 
> them together in your slim set.
>
> Using these criteria  this is the basic cellular process eukaryotic slim 
> I use (or slight variations of): The number of annotations (for pombe 
> obviously) is in parentheses (protein coding only).
>
> GO:0055085 transmembrane transport (278)
> GO:0006913 nucleocytoplasmic transport (114)
> GO:0006605 protein targeting (162)
> GO:0016192 vesicle-mediated transport (266)
> GO:0051186 cofactor metabolic process (139)
> GO:0006766 vitamin metabolic process (57)
> GO:0006790 sulfur metabolic process (45)
> GO:0006807 nitrogen compound metabolic process (224)
> GO:0055086 nucleobase, nucleoside and nucleotide metabolic process (118)
> GO:0005975 carbohydrate metabolic process (199)
> GO:0006629 lipid metabolic process (201)
> GO:0006399 tRNA metabolic process (125)
> GO:0006520 amino acid metabolic process (187)
> GO:0006412 translation (357)
> GO:0006259 DNA metabolic process (296)
> GO:0006508 protolysis (223)
> GO:0005975 carbohydrate metabolic process (199)
> GO:0016071 mRNA metabolic process (204)
> GO:0043413 biopolymer glycosylation (65) possibly drop?
> GO:0006464 protein modification process (585)
> GO:0007059 chromosome segregation (186)
> GO:0007049 cell cycle (552)
> GO:0007010 cytoskeletal organization and biogenesis (236)
> GO:0000910 cytokinesis (145)
> GO:0007165 signal transduction (362)
> GO:0006457 protein folding (80)
> GO:0042254 ribosome biogenesis and assembly (223)
> GO:0045229 external encapsulating structure organization and biogenesis 
> (124)
> GO:xxxxxxxx general transcription (see note *1 below)
> GO:0032569 specific transcription from RNA polymerase II promoter (102)
> (total 424 for all transcription)
> GO:0000902 cell morphogenesis (86)
> GO:0006338 establishment and/or maintenance of chromatin architecture (231)
> GO:reproductive process (182)
> GO:0007005 mitochondrion organization and biogenesis (251)
> GO:0006091 generation of precursor metabolites and energy (113)
> GO:0007031 peroxisome organization and biogenesis (20)
>
> At this point there are about ~100 pombe genes (out of the 3960 with an 
> annotated process term) which aren't included in the slim
>
> I could also include....
> vacuolar transport (91) reduces by 6 (most also annotated to protein 
> targeting)
> telomere maintenance (54) reduces by 6 (most also annotated to DNA met)
> snoRNA metabolic process (10) reduces by 2
> ...to improve coverage (very slightly)
>
> Finally I include
> GO:0006950 response to stress (444)
> this terms has overlaps with most other processes so is largely 
> redundant but are useful.
>
> This  leaves ~30 pombe with a process annotation unassigned to the GO 
> slim; these are often to terms like homeostasis and its children, or 
> otherwise uniformative terms
>
> For some purposes I would also include
> GO:0065007 biological regulation  (1021)
> but I don't know if this is a good term to include in a generic slim
>
> To make this work for multicellular eukaryotes, we would probably 
> want to add non-cellular process terms like:
>
> developmental process
> immune system process
>
>
> * Note1 it is not currently possible to retrieve genes involved in 
> general transcription as opposed to gene specific transcription (i.e RNA 
> I,II and III polymerases etc),  with a single query. This is also 
> important for enrichment as the genes in these 2 sets are very different 
> in terms of species distribution, copy number and viability. I requested 
> a grouping term for these processes a while ago and hopefully this will 
> be implemented shortly.
>
> See:
> https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764
>
>
> Val
>
>
>
>
>
>
> Ben Hitz wrote:
>   
>> Emily -
>> I have interest in working on the generic go slim; I need it (or  
>> something similar) to define graphics for an interaction network.
>>
>> Ben
>>
>>
>> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote:
>>
>>   
>>     
>>> Hi,
>>>
>>> From replying to a user request, I've just been having a quick look at
>>> the composition of the generic GO slim, and relating the GO terms
>>> included to the number of annotations displayed by AmiGO.
>>>
>>> Should, for instance, the 'cell recognition' term still be included in
>>> the generic GO slim? - it has only been annotated to 182 gene  
>>> products,
>>> whereas its sibling terms: 'cell division', 'cell cycle' and 'cell
>>> motility', have not been included even though they (directly or
>>> indirectly) have been annotated to more than 1,200 gene products each.
>>> Similarly, the term 'cytoplasm organization and biogenesis' is in  
>>> the GO
>>> slim but only has 113 gps annotated, whereas the 'membrane  
>>> organisation
>>> and biogenesis' term has been annotated to 1,509 gps.
>>>
>>> I was just wondering what the goal of the generic GO slim is... if  
>>> terms
>>> are selected on the basis that as many annotated gene products from
>>> different organisms should get mapped to descriptive GO terms before
>>> they are caught by the BP, MF, CC root terms (while also providing a
>>> full selection of terms across the whole GO vocabulary), should we  
>>> think
>>> of reviewing its some of its composition in relation to overall
>>> annotation frequency? Or should the GO slim be kept as stable as  
>>> possible?
>>>
>>> Cheers,
>>> Emily
>>>
>>> -- 
>>>
>>>
>>>
>>> ------------------------------------------------------------------
>>>
>>>    Emily Dimmer Ph.D.
>>>    GOA Coordinator
>>>    EMBL-EBI
>>>    Wellcome Trust Genome Campus
>>>    Hinxton
>>>    Cambridge CB10 1SD, U.K.
>>>    Tel:     +44 1223 494654
>>>    Fax:    +44 1223 494468
>>>    email:  edimmer at ebi.ac.uk
>>>    URL:    http://www.ebi.ac.uk/goa
>>>
>>>
>>> _______________________________________________
>>> Go mailing list
>>> Go at geneontology.org
>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>     
>>>       
>> --
>> Ben Hitz
>> Senior Scientific Programmer ** Saccharomyces Genome Database ** GO  
>> Consortium
>> Stanford University ** hitz at genome.stanford.edu
>>
>>
>>
>> _______________________________________________
>> Go mailing list
>> Go at geneontology.org
>> http://fafner.stanford.edu/mailman/listinfo/go
>>
>>
>>
>>   
>>     
>
>
>   


More information about the Go mailing list