[Go] Composition of the generic GO slim

Jane Lomax jane at ebi.ac.uk
Fri May 9 05:10:10 PDT 2008


I don't have any strong feelings about how the generic GO slim is 
generated, just as long as it's up-to-date and we have some documented, 
logical basis to how we do it.

Lets not forget about this - it's important...

Jane

Judith Blake wrote:
> ahhhh not another WG :)
>
> I think it might be accomplished by taking the 12-16 subdivisions in 
> either the human or fly genome papers that subdivide cellular roles, 
> and look for similar sets in a text book for CC and MF by chapter 
> titles.  This number of subdivisions is the most useful for general 
> overview.  I think the single-cell concerns may not be so important at 
> this level of 'genericism'; some subdivisions might be more or less 
> devoid of annotaitons...or maybe it is useful to have two..but we 
> could start with one.
>
> Then figure out how to sum GO to those terms.
>
> Start with the biology, not the ontology.
>
> of course, I biasly think the MGI go-slim accomplishes this to some 
> extent.
>
> I think a draft of this could be done in a week by a dedicated 
> curator.  but who?  I'll think about this.
>
> Judy
>
>
> Jane Lomax wrote:
>> Hi - sorry, only just got to this thread...
>>
>> From an advocacy point of view I think it's crucial for us to provide 
>> a generic GO slim that's up to date with the ontologies. As others 
>> have said, most naive users are not going to have the knowledge to 
>> create their own tailored slims in the beginning, so we need to 
>> provide something general for them to start from, especially as the 
>> pre-built slims are now part of the AmiGO GO slim mapper. Users can 
>> then trim or expand as they see fit for their own purposes as they 
>> become more familiar with the technology.
>>
>> Users blindly using the generic slim in a formal analysis without an 
>> understanding of the underlying mechanism are, quite frankly, not 
>> performing good science. This should be weeded out at the level of 
>> peer review, just the same as with any other misuse of bioinformatics 
>> apps.
>>
>> Perhaps the documentation for the generic GO slim might say something 
>> like:
>>
>> "GO provides a generic GO slim which, like the GO itself, is not 
>> species specific. This should be a suitable starting point for most 
>> investigations as it has broad coverage over most annotations. Users 
>> should tailor this GO slim according to the specific requirements of 
>> their own research".
>>
>> I like Val's suggestions for creating the generic GO slim - how about 
>> we set up a WG?
>>
>> Jane
>>
>> Judith Blake wrote:
>>> agreed,
>>> we should remove or change the text to reflect reality.
>>> judy
>>>
>>> Valerie Wood wrote:
>>>  
>>>> The GO website makes the following statement, which is a bit 
>>>> misleading if we don't intend to provide any comprehensive 
>>>> slims....(as Emily pointed out earlier in this thread, this isn't a 
>>>> comprehensive slim....)
>>>>
>>>> "GO provides a generic GO slim which, like the GO itself, is not 
>>>> species specific, and which should be suitable for most purposes.
>>>>
>>>> So maybe this slim should not be decribed as such?
>>>>
>>>>
>>>>
>>>>
>>>> Judith Blake <jblake at informatics.jax.org> wrote:     
>>>>> Val,
>>>>> My point really is that experiments are done in context.  A person 
>>>>> studying metabolism may want to break out those terms by 
>>>>> particular sub-divisions and lump other things.  One of the roles 
>>>>> of collaborating GO people would be to add in the construction of 
>>>>> particular slims if requested.
>>>>>
>>>>> For example, when I have done this, the researcher provided a list 
>>>>> of 12-16 subdivisions that made sense for their purpose, and we 
>>>>> constructed a GO_slim that subdivided the GO appropriately.  I 
>>>>> think of it as part of the data analysis process.  A researcher 
>>>>> using a generic GO_slim without understanding the vagaries of the 
>>>>> annotations or of the ontology subtrees will neither understand 
>>>>> the results.
>>>>>
>>>>> my opinion.
>>>>> judy
>>>>>
>>>>> Valerie Wood wrote:
>>>>>         
>>>>>> Judy,
>>>>>>
>>>>>> You are correct  that no one slim is going to fit all organisms 
>>>>>> or all uses.
>>>>>> However it isn't simple  to create an informative slim which 
>>>>>> gives complete
>>>>>> (or nearly complete) coverage of all of an organisms annotations 
>>>>>> (and complete
>>>>>> coverage of the annotation space  is an important feature
>>>>>> of a robust slim). Does the drosophila slim set cover all of the 
>>>>>> annotated genes?
>>>>>>
>>>>>> The slim I suggested will give complete coverage for single-celled
>>>>>> eukaryotes (it may need additional high level terms to cover
>>>>>> muliticellular eukaryotes). This particular slim is useful for 
>>>>>> evaluating
>>>>>> an organisms "cell biology". Perhaps a very generic slim, which 
>>>>>> only includes
>>>>>> very high level terms would be useful multicellular organisms,
>>>>>> but it would not be so useful for single-celled organisms.
>>>>>>
>>>>>> One suggested criteria (6 in previou) suggested that terms be 
>>>>>> meaningful to biologists.
>>>>>> What I meant here was that the terms should be was that the terms 
>>>>>> should
>>>>>> be 'biologically informative'. For cellular roles, or for a 
>>>>>> single-celled
>>>>>> organism 'metabolism isn't so useful as a 'direct'  slim term ( 
>>>>>> metabolic processes
>>>>>> include transcription, translation, DNA replication, mRNA 
>>>>>> processing etc.,
>>>>>> in addition to primary and secondary metabolism). For pombe 3102 
>>>>>> of 4194 process annotated gene products are annotated to metabolism,
>>>>>> so this term in a slim does not tell you very much.
>>>>>>
>>>>>> In addition, if metabolism is included as a 'direct' slim term, 
>>>>>> and you have a gene product
>>>>>> which is annotated ONLY to "metabolic process" then you really 
>>>>>> know very
>>>>>> little about its biological role. This can occur as frequently as 
>>>>>> it is possible to
>>>>>> predict that a protein has catalytic activity, and is involved in 
>>>>>> a 'metabolic process'
>>>>>> but not to say anything more specific; there are many direct 
>>>>>> Interpro mappings
>>>>>> to these two terms.  If I was trying to assess the 'real 
>>>>>> biological roles' of my organisms
>>>>>> gene products, I would wish to exclude direct annotations to 
>>>>>> 'metabolic process' from the slim.
>>>>>>
>>>>>> A GO slim provides a mechanism to filter out annotations to high 
>>>>>> level
>>>>>> relatively uninformative (with respect to the biological role)  
>>>>>> nodes like
>>>>>> 'metabolism, cellular  process, localization' (in the slim, they 
>>>>>> will be annotated
>>>>>> to  'unknown' if there is no annotation  to one of your slim 
>>>>>> terms or their children).
>>>>>>
>>>>>> Once you exclude a term like metabolism it becomes necessary
>>>>>> to include all of the child terms (or a combination of child 
>>>>>> terms ) to give complete
>>>>>> coverage of the parent term ( NOTE: once the slimmed terms are 
>>>>>> mapped
>>>>>> to the slim ontology the  high level terms will be
>>>>>> included, but their totals will only reflect the  total of the 
>>>>>> gene products
>>>>>> annotated via the terms in the slim).
>>>>>>
>>>>>> The difficult part is in building a slim is identifying the set 
>>>>>> of terms which
>>>>>> provides complete coverage; this is the tricky step for most 
>>>>>> biologists,
>>>>>> who are not so familiar with the ontologies. It would be useful 
>>>>>> to provide a
>>>>>> starting slim which gives complete coverage of all annotations 
>>>>>> (using
>>>>>> biologically relevant terms for common applications) which they 
>>>>>> can change as necessary.
>>>>>> Maybe we should provide a set of 'complete coverage' slims for 
>>>>>> common
>>>>>> applications.
>>>>>>
>>>>>> i.e.
>>>>>> suitable for multicellular organisms and very general biological 
>>>>>> roles
>>>>>> suitable for single-celled eukaryotes, or evaluating basic 
>>>>>> cellular processes
>>>>>>
>>>>>> Val
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Judith Blake wrote:
>>>>>>             
>>>>>>> Val,
>>>>>>> I still maintain that users need to be able to generate grouping 
>>>>>>> criteria based on their usage.    I think we could go back to 
>>>>>>> the fly genome paper and see the primary molecular divisions 
>>>>>>> that seemed most useful to describe the genome properties.  like 
>>>>>>> 'reproduction' and 'metabolism'.  Anything more granular is 
>>>>>>> specific to the user.  A mapping on this basis would likely 
>>>>>>> include fewer than 20 terms and would subdivide trees.
>>>>>>>
>>>>>>> judy
>>>>>>>
>>>>>>> Valerie Wood wrote:
>>>>>>>                 
>>>>>>>> I think it is good idea for the consortium to provide an 
>>>>>>>> official 'GO slim', and advise people how they may want to 
>>>>>>>> alter the slim to fit their individual purpose.
>>>>>>>>
>>>>>>>> A useful generic GO slim has a number of qualities (I have 
>>>>>>>> tried to list these below, please suggest any additional ones, 
>>>>>>>> I hadn't really thought before about what the rules were I used 
>>>>>>>> for making a slim so this is the first time I have documented 
>>>>>>>> them). Following the 'guidelines' below I have suggested a set 
>>>>>>>> of process which I think should make up the generic process slim.
>>>>>>>>
>>>>>>>> Perhaps we could use this as a starting point, and people can 
>>>>>>>> suggest additional terms (with reasons) or terms which should 
>>>>>>>> be removed. This provides good coverage of basic cellular 
>>>>>>>> processes but would need extending to cover multicellular 
>>>>>>>> processes.
>>>>>>>>
>>>>>>>> GO Slim criteria
>>>>>>>>
>>>>>>>> 1. The generic slim should be  as organism independent as 
>>>>>>>> possible (although clearly some terms will not be applicable to 
>>>>>>>> single celled eukaryotes and some eukaryotic terms will not be 
>>>>>>>> applicable to prokaryotes)
>>>>>>>>
>>>>>>>> 2. The slim should cover AS MANY genes with annotated processes 
>>>>>>>> as possible
>>>>>>>>
>>>>>>>> 3. The slim should cover AS MANY genes with annotated processes 
>>>>>>>> with the smallest number of leaf node terms (if you include too 
>>>>>>>> many terms and it becomes too large and you start to loose the 
>>>>>>>> advantages of a slim).
>>>>>>>>
>>>>>>>> 4. It might be useful to try to avoid terms with an excessively 
>>>>>>>> small or large number of small number of annotations (i.e 
>>>>>>>> ideally your terms will not have an extreme distributions for 
>>>>>>>> your histogram)
>>>>>>>>
>>>>>>>> 5. Preferably the slim should include  sibling terms with a 
>>>>>>>> large overlaps between them. If you choose two siblings with 
>>>>>>>> 200 genes annotated to each, and the majority of the 
>>>>>>>> annotations  overlap, it is usually better to select the parent 
>>>>>>>> node (i.e replace 2 terms by one single term). Conversely, if 
>>>>>>>> the child terms of a  node fall into distinct non-overlapping 
>>>>>>>> subsets, it might be more informative to include both child 
>>>>>>>> terms in your slim (see also point 7 below)
>>>>>>>>
>>>>>>>> 6. For most purposes you need to include a representative term 
>>>>>>>> for all biologically relevant processes, by including terms 
>>>>>>>> which are meaningful to biologists.
>>>>>>>>
>>>>>>>> 7. If you are using your slim for data analysis (and not just 
>>>>>>>> for vizualization) you need to include terms which will allow 
>>>>>>>> you to distinguish genes bases on their biological properties.
>>>>>>>> For example, it is not good to lump all genes involved in 
>>>>>>>> transport under transport because the genes annotated to 
>>>>>>>> distinct child terms; vesicle -mediated transport, protein 
>>>>>>>> targeting, transmembrane transport, are VERY different in term 
>>>>>>>> of their i) viability ii) species distribution iii) number of 
>>>>>>>> interaction partners iv) copy number v) expression pattern, so 
>>>>>>>> it does not make sense to lump them together in your slim set.
>>>>>>>>
>>>>>>>> Using these criteria  this is the basic cellular process 
>>>>>>>> eukaryotic slim I use (or slight variations of): The number of 
>>>>>>>> annotations (for pombe obviously) is in parentheses (protein 
>>>>>>>> coding only).
>>>>>>>>
>>>>>>>> GO:0055085 transmembrane transport (278)
>>>>>>>> GO:0006913 nucleocytoplasmic transport (114)
>>>>>>>> GO:0006605 protein targeting (162)
>>>>>>>> GO:0016192 vesicle-mediated transport (266)
>>>>>>>> GO:0051186 cofactor metabolic process (139)
>>>>>>>> GO:0006766 vitamin metabolic process (57)
>>>>>>>> GO:0006790 sulfur metabolic process (45)
>>>>>>>> GO:0006807 nitrogen compound metabolic process (224)
>>>>>>>> GO:0055086 nucleobase, nucleoside and nucleotide metabolic 
>>>>>>>> process (118)
>>>>>>>> GO:0005975 carbohydrate metabolic process (199)
>>>>>>>> GO:0006629 lipid metabolic process (201)
>>>>>>>> GO:0006399 tRNA metabolic process (125)
>>>>>>>> GO:0006520 amino acid metabolic process (187)
>>>>>>>> GO:0006412 translation (357)
>>>>>>>> GO:0006259 DNA metabolic process (296)
>>>>>>>> GO:0006508 protolysis (223)
>>>>>>>> GO:0005975 carbohydrate metabolic process (199)
>>>>>>>> GO:0016071 mRNA metabolic process (204)
>>>>>>>> GO:0043413 biopolymer glycosylation (65) possibly drop?
>>>>>>>> GO:0006464 protein modification process (585)
>>>>>>>> GO:0007059 chromosome segregation (186)
>>>>>>>> GO:0007049 cell cycle (552)
>>>>>>>> GO:0007010 cytoskeletal organization and biogenesis (236)
>>>>>>>> GO:0000910 cytokinesis (145)
>>>>>>>> GO:0007165 signal transduction (362)
>>>>>>>> GO:0006457 protein folding (80)
>>>>>>>> GO:0042254 ribosome biogenesis and assembly (223)
>>>>>>>> GO:0045229 external encapsulating structure organization and 
>>>>>>>> biogenesis (124)
>>>>>>>> GO:xxxxxxxx general transcription (see note *1 below)
>>>>>>>> GO:0032569 specific transcription from RNA polymerase II 
>>>>>>>> promoter (102)
>>>>>>>> (total 424 for all transcription)
>>>>>>>> GO:0000902 cell morphogenesis (86)
>>>>>>>> GO:0006338 establishment and/or maintenance of chromatin 
>>>>>>>> architecture (231)
>>>>>>>> GO:reproductive process (182)
>>>>>>>> GO:0007005 mitochondrion organization and biogenesis (251)
>>>>>>>> GO:0006091 generation of precursor metabolites and energy (113)
>>>>>>>> GO:0007031 peroxisome organization and biogenesis (20)
>>>>>>>>
>>>>>>>> At this point there are about ~100 pombe genes (out of the 3960 
>>>>>>>> with an annotated process term) which aren't included in the slim
>>>>>>>>
>>>>>>>> I could also include....
>>>>>>>> vacuolar transport (91) reduces by 6 (most also annotated to 
>>>>>>>> protein targeting)
>>>>>>>> telomere maintenance (54) reduces by 6 (most also annotated to 
>>>>>>>> DNA met)
>>>>>>>> snoRNA metabolic process (10) reduces by 2
>>>>>>>> ...to improve coverage (very slightly)
>>>>>>>>
>>>>>>>> Finally I include
>>>>>>>> GO:0006950 response to stress (444)
>>>>>>>> this terms has overlaps with most other processes so is largely 
>>>>>>>> redundant but are useful.
>>>>>>>>
>>>>>>>> This  leaves ~30 pombe with a process annotation unassigned to 
>>>>>>>> the GO slim; these are often to terms like homeostasis and its 
>>>>>>>> children, or otherwise uniformative terms
>>>>>>>>
>>>>>>>> For some purposes I would also include
>>>>>>>> GO:0065007 biological regulation  (1021)
>>>>>>>> but I don't know if this is a good term to include in a generic 
>>>>>>>> slim
>>>>>>>>
>>>>>>>> To make this work for multicellular eukaryotes, we would 
>>>>>>>> probably want to add non-cellular process terms like:
>>>>>>>>
>>>>>>>> developmental process
>>>>>>>> immune system process
>>>>>>>>
>>>>>>>>
>>>>>>>> * Note1 it is not currently possible to retrieve genes involved 
>>>>>>>> in general transcription as opposed to gene specific 
>>>>>>>> transcription (i.e RNA I,II and III polymerases etc),  with a 
>>>>>>>> single query. This is also important for enrichment as the 
>>>>>>>> genes in these 2 sets are very different in terms of species 
>>>>>>>> distribution, copy number and viability. I requested a grouping 
>>>>>>>> term for these processes a while ago and hopefully this will be 
>>>>>>>> implemented shortly.
>>>>>>>>
>>>>>>>> See:
>>>>>>>> https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Val
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Ben Hitz wrote:
>>>>>>>>  
>>>>>>>>                     
>>>>>>>>> Emily -
>>>>>>>>> I have interest in working on the generic go slim; I need it 
>>>>>>>>> (or  something similar) to define graphics for an interaction 
>>>>>>>>> network.
>>>>>>>>>
>>>>>>>>> Ben
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote:
>>>>>>>>>
>>>>>>>>>                             
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> From replying to a user request, I've just been having a 
>>>>>>>>>> quick look at
>>>>>>>>>> the composition of the generic GO slim, and relating the GO 
>>>>>>>>>> terms
>>>>>>>>>> included to the number of annotations displayed by AmiGO.
>>>>>>>>>>
>>>>>>>>>> Should, for instance, the 'cell recognition' term still be 
>>>>>>>>>> included in
>>>>>>>>>> the generic GO slim? - it has only been annotated to 182 
>>>>>>>>>> gene  products,
>>>>>>>>>> whereas its sibling terms: 'cell division', 'cell cycle' and 
>>>>>>>>>> 'cell
>>>>>>>>>> motility', have not been included even though they (directly or
>>>>>>>>>> indirectly) have been annotated to more than 1,200 gene 
>>>>>>>>>> products each.
>>>>>>>>>> Similarly, the term 'cytoplasm organization and biogenesis' 
>>>>>>>>>> is in  the GO
>>>>>>>>>> slim but only has 113 gps annotated, whereas the 'membrane  
>>>>>>>>>> organisation
>>>>>>>>>> and biogenesis' term has been annotated to 1,509 gps.
>>>>>>>>>>
>>>>>>>>>> I was just wondering what the goal of the generic GO slim 
>>>>>>>>>> is... if  terms
>>>>>>>>>> are selected on the basis that as many annotated gene 
>>>>>>>>>> products from
>>>>>>>>>> different organisms should get mapped to descriptive GO terms 
>>>>>>>>>> before
>>>>>>>>>> they are caught by the BP, MF, CC root terms (while also 
>>>>>>>>>> providing a
>>>>>>>>>> full selection of terms across the whole GO vocabulary), 
>>>>>>>>>> should we  think
>>>>>>>>>> of reviewing its some of its composition in relation to overall
>>>>>>>>>> annotation frequency? Or should the GO slim be kept as stable 
>>>>>>>>>> as  possible?
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Emily
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------------ 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    Emily Dimmer Ph.D.
>>>>>>>>>>    GOA Coordinator
>>>>>>>>>>    EMBL-EBI
>>>>>>>>>>    Wellcome Trust Genome Campus
>>>>>>>>>>    Hinxton
>>>>>>>>>>    Cambridge CB10 1SD, U.K.
>>>>>>>>>>    Tel:     +44 1223 494654
>>>>>>>>>>    Fax:    +44 1223 494468
>>>>>>>>>>    email:  edimmer at ebi.ac.uk
>>>>>>>>>>    URL:    http://www.ebi.ac.uk/goa
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Go mailing list
>>>>>>>>>> Go at geneontology.org
>>>>>>>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>>>>>>>                                         
>>>>>>>>> -- 
>>>>>>>>> Ben Hitz
>>>>>>>>> Senior Scientific Programmer ** Saccharomyces Genome Database 
>>>>>>>>> ** GO  Consortium
>>>>>>>>> Stanford University ** hitz at genome.stanford.edu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Go mailing list
>>>>>>>>> Go at geneontology.org
>>>>>>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                                 
>>>>>>>>                         
>>>>>>>                   
>>>>>>               
>>>>>           
>>>>       
>>> _______________________________________________
>>> Go mailing list
>>> Go at geneontology.org
>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>   
>>
>



More information about the Go mailing list