[Go] Composition of the generic GO slim

Judith Blake jblake at informatics.jax.org
Mon May 5 11:40:36 PDT 2008


agreed,
we should remove or change the text to reflect reality.
judy

Valerie Wood wrote:
> The GO website makes the following statement, which is a bit misleading if we don't intend to provide any comprehensive slims....(as Emily pointed out earlier in this thread, this isn't a comprehensive slim....)
>
> "GO provides a generic GO slim which, like the GO itself, is not species specific, and which should be suitable for most purposes.
>
> So maybe this slim should not be decribed as such?
>
>
>
>
> Judith Blake <jblake at informatics.jax.org> wrote: 
>   
>> Val,
>> My point really is that experiments are done in context.  A person 
>> studying metabolism may want to break out those terms by particular 
>> sub-divisions and lump other things.  One of the roles of collaborating 
>> GO people would be to add in the construction of particular slims if 
>> requested.
>>
>> For example, when I have done this, the researcher provided a list of 
>> 12-16 subdivisions that made sense for their purpose, and we constructed 
>> a GO_slim that subdivided the GO appropriately.  I think of it as part 
>> of the data analysis process.  A researcher using a generic GO_slim 
>> without understanding the vagaries of the annotations or of the ontology 
>> subtrees will neither understand the results.
>>
>> my opinion.
>> judy
>>
>> Valerie Wood wrote:
>>     
>>> Judy,
>>>
>>> You are correct  that no one slim is going to fit all organisms or all 
>>> uses.
>>> However it isn't simple  to create an informative slim which gives 
>>> complete
>>> (or nearly complete) coverage of all of an organisms annotations (and 
>>> complete
>>> coverage of the annotation space  is an important feature
>>> of a robust slim). Does the drosophila slim set cover all of the 
>>> annotated genes?
>>>
>>> The slim I suggested will give complete coverage for single-celled
>>> eukaryotes (it may need additional high level terms to cover
>>> muliticellular eukaryotes). This particular slim is useful for evaluating
>>> an organisms "cell biology". Perhaps a very generic slim, which only 
>>> includes
>>> very high level terms would be useful multicellular organisms,
>>> but it would not be so useful for single-celled organisms.
>>>
>>> One suggested criteria (6 in previou) suggested that terms be 
>>> meaningful to biologists.
>>> What I meant here was that the terms should be was that the terms should
>>> be 'biologically informative'. For cellular roles, or for a single-celled
>>> organism 'metabolism isn't so useful as a 'direct'  slim term ( 
>>> metabolic processes
>>> include transcription, translation, DNA replication, mRNA processing 
>>> etc.,
>>> in addition to primary and secondary metabolism). For pombe 3102 of 
>>> 4194 process annotated gene products are annotated to metabolism,
>>> so this term in a slim does not tell you very much.
>>>
>>> In addition, if metabolism is included as a 'direct' slim term, and 
>>> you have a gene product
>>> which is annotated ONLY to "metabolic process" then you really know very
>>> little about its biological role. This can occur as frequently as it 
>>> is possible to
>>> predict that a protein has catalytic activity, and is involved in a 
>>> 'metabolic process'
>>> but not to say anything more specific; there are many direct Interpro 
>>> mappings
>>> to these two terms.  If I was trying to assess the 'real biological 
>>> roles' of my organisms
>>> gene products, I would wish to exclude direct annotations to 
>>> 'metabolic process' from the slim.
>>>
>>> A GO slim provides a mechanism to filter out annotations to high level
>>> relatively uninformative (with respect to the biological role)  nodes 
>>> like
>>> 'metabolism, cellular  process, localization' (in the slim, they will 
>>> be annotated
>>> to  'unknown' if there is no annotation  to one of your slim terms or 
>>> their children).
>>>
>>> Once you exclude a term like metabolism it becomes necessary
>>> to include all of the child terms (or a combination of child terms ) 
>>> to give complete
>>> coverage of the parent term ( NOTE: once the slimmed terms are mapped
>>> to the slim ontology the  high level terms will be
>>> included, but their totals will only reflect the  total of the gene 
>>> products
>>> annotated via the terms in the slim).
>>>
>>> The difficult part is in building a slim is identifying the set of 
>>> terms which
>>> provides complete coverage; this is the tricky step for most biologists,
>>> who are not so familiar with the ontologies. It would be useful to 
>>> provide a
>>> starting slim which gives complete coverage of all annotations (using
>>> biologically relevant terms for common applications) which they can 
>>> change as necessary.
>>> Maybe we should provide a set of 'complete coverage' slims for common
>>> applications.
>>>
>>> i.e.
>>> suitable for multicellular organisms and very general biological roles
>>> suitable for single-celled eukaryotes, or evaluating basic cellular 
>>> processes
>>>
>>> Val
>>>
>>>
>>>
>>>
>>> Judith Blake wrote:
>>>       
>>>> Val,
>>>> I still maintain that users need to be able to generate grouping 
>>>> criteria based on their usage.    I think we could go back to the fly 
>>>> genome paper and see the primary molecular divisions that seemed most 
>>>> useful to describe the genome properties.  like 'reproduction' and 
>>>> 'metabolism'.  Anything more granular is specific to the user.  A 
>>>> mapping on this basis would likely include fewer than 20 terms and 
>>>> would subdivide trees.
>>>>
>>>> judy
>>>>
>>>> Valerie Wood wrote:
>>>>         
>>>>> I think it is good idea for the consortium to provide an official 
>>>>> 'GO slim', and advise people how they may want to alter the slim to 
>>>>> fit their individual purpose.
>>>>>
>>>>> A useful generic GO slim has a number of qualities (I have tried to 
>>>>> list these below, please suggest any additional ones, I hadn't 
>>>>> really thought before about what the rules were I used for making a 
>>>>> slim so this is the first time I have documented them). Following 
>>>>> the 'guidelines' below I have suggested a set of process which I 
>>>>> think should make up the generic process slim.
>>>>>
>>>>> Perhaps we could use this as a starting point, and people can 
>>>>> suggest additional terms (with reasons) or terms which should be 
>>>>> removed. This provides good coverage of basic cellular processes but 
>>>>> would need extending to cover multicellular processes.
>>>>>
>>>>> GO Slim criteria
>>>>>
>>>>> 1. The generic slim should be  as organism independent as possible 
>>>>> (although clearly some terms will not be applicable to single celled 
>>>>> eukaryotes and some eukaryotic terms will not be applicable to 
>>>>> prokaryotes)
>>>>>
>>>>> 2. The slim should cover AS MANY genes with annotated processes as 
>>>>> possible
>>>>>
>>>>> 3. The slim should cover AS MANY genes with annotated processes with 
>>>>> the smallest number of leaf node terms (if you include too many 
>>>>> terms and it becomes too large and you start to loose the advantages 
>>>>> of a slim).
>>>>>
>>>>> 4. It might be useful to try to avoid terms with an excessively 
>>>>> small or large number of small number of annotations (i.e ideally 
>>>>> your terms will not have an extreme distributions for your histogram)
>>>>>
>>>>> 5. Preferably the slim should include  sibling terms with a large 
>>>>> overlaps between them. If you choose two siblings with 200 genes 
>>>>> annotated to each, and the majority of the annotations  overlap, it 
>>>>> is usually better to select the parent node (i.e replace 2 terms by 
>>>>> one single term). Conversely, if the child terms of a  node fall 
>>>>> into distinct non-overlapping subsets, it might be more informative 
>>>>> to include both child terms in your slim (see also point 7 below)
>>>>>
>>>>> 6. For most purposes you need to include a representative term for 
>>>>> all biologically relevant processes, by including terms which are 
>>>>> meaningful to biologists.
>>>>>
>>>>> 7. If you are using your slim for data analysis (and not just for 
>>>>> vizualization) you need to include terms which will allow you to 
>>>>> distinguish genes bases on their biological properties.
>>>>> For example, it is not good to lump all genes involved in transport 
>>>>> under transport because the genes annotated to distinct child terms; 
>>>>> vesicle -mediated transport, protein targeting, transmembrane 
>>>>> transport, are VERY different in term of their i) viability ii) 
>>>>> species distribution iii) number of interaction partners iv) copy 
>>>>> number v) expression pattern, so it does not make sense to lump them 
>>>>> together in your slim set.
>>>>>
>>>>> Using these criteria  this is the basic cellular process eukaryotic 
>>>>> slim I use (or slight variations of): The number of annotations (for 
>>>>> pombe obviously) is in parentheses (protein coding only).
>>>>>
>>>>> GO:0055085 transmembrane transport (278)
>>>>> GO:0006913 nucleocytoplasmic transport (114)
>>>>> GO:0006605 protein targeting (162)
>>>>> GO:0016192 vesicle-mediated transport (266)
>>>>> GO:0051186 cofactor metabolic process (139)
>>>>> GO:0006766 vitamin metabolic process (57)
>>>>> GO:0006790 sulfur metabolic process (45)
>>>>> GO:0006807 nitrogen compound metabolic process (224)
>>>>> GO:0055086 nucleobase, nucleoside and nucleotide metabolic process 
>>>>> (118)
>>>>> GO:0005975 carbohydrate metabolic process (199)
>>>>> GO:0006629 lipid metabolic process (201)
>>>>> GO:0006399 tRNA metabolic process (125)
>>>>> GO:0006520 amino acid metabolic process (187)
>>>>> GO:0006412 translation (357)
>>>>> GO:0006259 DNA metabolic process (296)
>>>>> GO:0006508 protolysis (223)
>>>>> GO:0005975 carbohydrate metabolic process (199)
>>>>> GO:0016071 mRNA metabolic process (204)
>>>>> GO:0043413 biopolymer glycosylation (65) possibly drop?
>>>>> GO:0006464 protein modification process (585)
>>>>> GO:0007059 chromosome segregation (186)
>>>>> GO:0007049 cell cycle (552)
>>>>> GO:0007010 cytoskeletal organization and biogenesis (236)
>>>>> GO:0000910 cytokinesis (145)
>>>>> GO:0007165 signal transduction (362)
>>>>> GO:0006457 protein folding (80)
>>>>> GO:0042254 ribosome biogenesis and assembly (223)
>>>>> GO:0045229 external encapsulating structure organization and 
>>>>> biogenesis (124)
>>>>> GO:xxxxxxxx general transcription (see note *1 below)
>>>>> GO:0032569 specific transcription from RNA polymerase II promoter (102)
>>>>> (total 424 for all transcription)
>>>>> GO:0000902 cell morphogenesis (86)
>>>>> GO:0006338 establishment and/or maintenance of chromatin 
>>>>> architecture (231)
>>>>> GO:reproductive process (182)
>>>>> GO:0007005 mitochondrion organization and biogenesis (251)
>>>>> GO:0006091 generation of precursor metabolites and energy (113)
>>>>> GO:0007031 peroxisome organization and biogenesis (20)
>>>>>
>>>>> At this point there are about ~100 pombe genes (out of the 3960 with 
>>>>> an annotated process term) which aren't included in the slim
>>>>>
>>>>> I could also include....
>>>>> vacuolar transport (91) reduces by 6 (most also annotated to protein 
>>>>> targeting)
>>>>> telomere maintenance (54) reduces by 6 (most also annotated to DNA met)
>>>>> snoRNA metabolic process (10) reduces by 2
>>>>> ...to improve coverage (very slightly)
>>>>>
>>>>> Finally I include
>>>>> GO:0006950 response to stress (444)
>>>>> this terms has overlaps with most other processes so is largely 
>>>>> redundant but are useful.
>>>>>
>>>>> This  leaves ~30 pombe with a process annotation unassigned to the 
>>>>> GO slim; these are often to terms like homeostasis and its children, 
>>>>> or otherwise uniformative terms
>>>>>
>>>>> For some purposes I would also include
>>>>> GO:0065007 biological regulation  (1021)
>>>>> but I don't know if this is a good term to include in a generic slim
>>>>>
>>>>> To make this work for multicellular eukaryotes, we would probably 
>>>>> want to add non-cellular process terms like:
>>>>>
>>>>> developmental process
>>>>> immune system process
>>>>>
>>>>>
>>>>> * Note1 it is not currently possible to retrieve genes involved in 
>>>>> general transcription as opposed to gene specific transcription (i.e 
>>>>> RNA I,II and III polymerases etc),  with a single query. This is 
>>>>> also important for enrichment as the genes in these 2 sets are very 
>>>>> different in terms of species distribution, copy number and 
>>>>> viability. I requested a grouping term for these processes a while 
>>>>> ago and hopefully this will be implemented shortly.
>>>>>
>>>>> See:
>>>>> https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764 
>>>>>
>>>>>
>>>>>
>>>>> Val
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Ben Hitz wrote:
>>>>>  
>>>>>           
>>>>>> Emily -
>>>>>> I have interest in working on the generic go slim; I need it (or  
>>>>>> something similar) to define graphics for an interaction network.
>>>>>>
>>>>>> Ben
>>>>>>
>>>>>>
>>>>>> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote:
>>>>>>
>>>>>>     
>>>>>>             
>>>>>>> Hi,
>>>>>>>
>>>>>>> From replying to a user request, I've just been having a quick 
>>>>>>> look at
>>>>>>> the composition of the generic GO slim, and relating the GO terms
>>>>>>> included to the number of annotations displayed by AmiGO.
>>>>>>>
>>>>>>> Should, for instance, the 'cell recognition' term still be 
>>>>>>> included in
>>>>>>> the generic GO slim? - it has only been annotated to 182 gene  
>>>>>>> products,
>>>>>>> whereas its sibling terms: 'cell division', 'cell cycle' and 'cell
>>>>>>> motility', have not been included even though they (directly or
>>>>>>> indirectly) have been annotated to more than 1,200 gene products 
>>>>>>> each.
>>>>>>> Similarly, the term 'cytoplasm organization and biogenesis' is in  
>>>>>>> the GO
>>>>>>> slim but only has 113 gps annotated, whereas the 'membrane  
>>>>>>> organisation
>>>>>>> and biogenesis' term has been annotated to 1,509 gps.
>>>>>>>
>>>>>>> I was just wondering what the goal of the generic GO slim is... 
>>>>>>> if  terms
>>>>>>> are selected on the basis that as many annotated gene products from
>>>>>>> different organisms should get mapped to descriptive GO terms before
>>>>>>> they are caught by the BP, MF, CC root terms (while also providing a
>>>>>>> full selection of terms across the whole GO vocabulary), should 
>>>>>>> we  think
>>>>>>> of reviewing its some of its composition in relation to overall
>>>>>>> annotation frequency? Or should the GO slim be kept as stable as  
>>>>>>> possible?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Emily
>>>>>>>
>>>>>>> -- 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------
>>>>>>>
>>>>>>>    Emily Dimmer Ph.D.
>>>>>>>    GOA Coordinator
>>>>>>>    EMBL-EBI
>>>>>>>    Wellcome Trust Genome Campus
>>>>>>>    Hinxton
>>>>>>>    Cambridge CB10 1SD, U.K.
>>>>>>>    Tel:     +44 1223 494654
>>>>>>>    Fax:    +44 1223 494468
>>>>>>>    email:  edimmer at ebi.ac.uk
>>>>>>>    URL:    http://www.ebi.ac.uk/goa
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Go mailing list
>>>>>>> Go at geneontology.org
>>>>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>>>>           
>>>>>>>               
>>>>>> -- 
>>>>>> Ben Hitz
>>>>>> Senior Scientific Programmer ** Saccharomyces Genome Database ** 
>>>>>> GO  Consortium
>>>>>> Stanford University ** hitz at genome.stanford.edu
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Go mailing list
>>>>>> Go at geneontology.org
>>>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>>>
>>>>>>
>>>>>>
>>>>>>       
>>>>>>             
>>>>>   
>>>>>           
>>>>
>>>>         
>>>       
>>     
>
>   


More information about the Go mailing list