[Go] Composition of the generic GO slim

Valerie Wood val at sanger.ac.uk
Mon May 5 09:51:02 PDT 2008


Judy,

You are correct  that no one slim is going to fit all organisms or all 
uses.
However it isn't simple  to create an informative slim which gives complete
(or nearly complete) coverage of all of an organisms annotations (and 
complete
coverage of the annotation space  is an important feature
of a robust slim). Does 
the drosophila slim set cover all of the annotated genes?

The slim I suggested will give complete coverage for single-celled
eukaryotes (it may need additional high level terms to cover
muliticellular eukaryotes). This particular slim is useful for evaluating
an organisms 
"cell biology". Perhaps a very generic slim, which only includes
very high level terms would be useful multicellular organisms,
but it would not be so useful for single-celled organisms.

One suggested criteria (6 in previou) suggested that terms be meaningful 
to biologists.
What I meant here was that the terms should be was that the terms should
be 'biologically informative'. For cellular roles, or for a single-celled
organism 'metabolism isn't so 
useful as a 'direct'  slim term ( metabolic processes
include transcription, translation, DNA replication, mRNA processing etc.,
 in addition to primary and secondary metabolism). 
For pombe 3102 of 4194 process annotated gene products 
are annotated to metabolism,
so this term in a slim does not tell you very much.

In addition, if metabolism is included as a 'direct' slim term, and you 
have a gene product
which is annotated ONLY to "metabolic process" then you really know very
little about its biological role. This can occur as frequently as it is 
possible to
predict that a protein has catalytic activity, and is involved in a 
'metabolic process'
but not to say anything more specific; 
there are many direct Interpro mappings
to these two terms.  If I was trying to assess the 'real biological 
roles' of my organisms
gene products, I would wish to exclude direct annotations to 'metabolic 
process' from the slim.

A GO slim provides a mechanism to filter out annotations to high level
relatively uninformative (with respect to the biological role)  nodes like
'metabolism, cellular  process, localization' (in the slim, they will be 
annotated
to  'unknown' if there is no annotation  to one of your slim terms or 
their children).

Once you exclude a term like metabolism it becomes necessary
to include all of the child terms (or a combination of child terms ) to 
give complete
coverage of the parent term ( NOTE: once the slimmed terms are mapped
to the slim ontology the  high level terms will be
included, but their totals will only reflect the  total of the gene 
products
annotated via the terms in the slim).

The difficult part is in building a slim is 
identifying the set of terms which
provides complete coverage; this is the tricky step for most biologists,
who are not so familiar with the ontologies. It would be 
useful to provide a
starting slim which gives complete coverage of all annotations (using
biologically relevant terms for common 
applications) which they can change as necessary.
Maybe we should provide a set of 'complete coverage' slims for common
applications.

i.e.
suitable for multicellular organisms and very general biological roles
suitable for single-celled eukaryotes, or evaluating basic cellular 
processes

Val




Judith Blake wrote:
> Val,
> I still maintain that users need to be able to generate grouping 
> criteria based on their usage.    I think we could go back to the fly 
> genome paper and see the primary molecular divisions that seemed most 
> useful to describe the genome properties.  like 'reproduction' and 
> 'metabolism'.  Anything more granular is specific to the user.  A 
> mapping on this basis would likely include fewer than 20 terms and 
> would subdivide trees.
>
> judy
>
> Valerie Wood wrote:
>> I think it is good idea for the consortium to provide an official 'GO 
>> slim', and advise people how they may want to alter the slim to fit 
>> their individual purpose.
>>
>> A useful generic GO slim has a number of qualities (I have tried to 
>> list these below, please suggest any additional ones, I hadn't really 
>> thought before about what the rules were I used for making a slim so 
>> this is the first time I have documented them). Following the 
>> 'guidelines' below I have suggested a set of process which I think 
>> should make up the generic process slim.
>>
>> Perhaps we could use this as a starting point, and people can suggest 
>> additional terms (with reasons) or terms which should be removed. 
>> This provides good coverage of basic cellular processes but would 
>> need extending to cover multicellular processes.
>>
>> GO Slim criteria
>>
>> 1. The generic slim should be  as organism independent as possible 
>> (although clearly some terms will not be applicable to single celled 
>> eukaryotes and some eukaryotic terms will not be applicable to 
>> prokaryotes)
>>
>> 2. The slim should cover AS MANY genes with annotated processes as 
>> possible
>>
>> 3. The slim should cover AS MANY genes with annotated processes with 
>> the smallest number of leaf node terms (if you include too many terms 
>> and it becomes too large and you start to loose the advantages of a 
>> slim).
>>
>> 4. It might be useful to try to avoid terms with an excessively small 
>> or large number of small number of annotations (i.e ideally your 
>> terms will not have an extreme distributions for your histogram)
>>
>> 5. Preferably the slim should include  sibling terms with a large 
>> overlaps between them. If you choose two siblings with 200 genes 
>> annotated to each, and the majority of the annotations  overlap, it 
>> is usually better to select the parent node (i.e replace 2 terms by 
>> one single term). Conversely, if the child terms of a  node fall into 
>> distinct non-overlapping subsets, it might be more informative to 
>> include both child terms in your slim (see also point 7 below)
>>
>> 6. For most purposes you need to include a representative term for 
>> all biologically relevant processes, by including terms which are 
>> meaningful to biologists.
>>
>> 7. If you are using your slim for data analysis (and not just for 
>> vizualization) you need to include terms which will allow you to 
>> distinguish genes bases on their biological properties.
>> For example, it is not good to lump all genes involved in transport 
>> under transport because the genes annotated to distinct child terms; 
>> vesicle -mediated transport, protein targeting, transmembrane 
>> transport, are VERY different in term of their i) viability ii) 
>> species distribution iii) number of interaction partners iv) copy 
>> number v) expression pattern, so it does not make sense to lump them 
>> together in your slim set.
>>
>> Using these criteria  this is the basic cellular process eukaryotic 
>> slim I use (or slight variations of): The number of annotations (for 
>> pombe obviously) is in parentheses (protein coding only).
>>
>> GO:0055085 transmembrane transport (278)
>> GO:0006913 nucleocytoplasmic transport (114)
>> GO:0006605 protein targeting (162)
>> GO:0016192 vesicle-mediated transport (266)
>> GO:0051186 cofactor metabolic process (139)
>> GO:0006766 vitamin metabolic process (57)
>> GO:0006790 sulfur metabolic process (45)
>> GO:0006807 nitrogen compound metabolic process (224)
>> GO:0055086 nucleobase, nucleoside and nucleotide metabolic process (118)
>> GO:0005975 carbohydrate metabolic process (199)
>> GO:0006629 lipid metabolic process (201)
>> GO:0006399 tRNA metabolic process (125)
>> GO:0006520 amino acid metabolic process (187)
>> GO:0006412 translation (357)
>> GO:0006259 DNA metabolic process (296)
>> GO:0006508 protolysis (223)
>> GO:0005975 carbohydrate metabolic process (199)
>> GO:0016071 mRNA metabolic process (204)
>> GO:0043413 biopolymer glycosylation (65) possibly drop?
>> GO:0006464 protein modification process (585)
>> GO:0007059 chromosome segregation (186)
>> GO:0007049 cell cycle (552)
>> GO:0007010 cytoskeletal organization and biogenesis (236)
>> GO:0000910 cytokinesis (145)
>> GO:0007165 signal transduction (362)
>> GO:0006457 protein folding (80)
>> GO:0042254 ribosome biogenesis and assembly (223)
>> GO:0045229 external encapsulating structure organization and 
>> biogenesis (124)
>> GO:xxxxxxxx general transcription (see note *1 below)
>> GO:0032569 specific transcription from RNA polymerase II promoter (102)
>> (total 424 for all transcription)
>> GO:0000902 cell morphogenesis (86)
>> GO:0006338 establishment and/or maintenance of chromatin architecture 
>> (231)
>> GO:reproductive process (182)
>> GO:0007005 mitochondrion organization and biogenesis (251)
>> GO:0006091 generation of precursor metabolites and energy (113)
>> GO:0007031 peroxisome organization and biogenesis (20)
>>
>> At this point there are about ~100 pombe genes (out of the 3960 with 
>> an annotated process term) which aren't included in the slim
>>
>> I could also include....
>> vacuolar transport (91) reduces by 6 (most also annotated to protein 
>> targeting)
>> telomere maintenance (54) reduces by 6 (most also annotated to DNA met)
>> snoRNA metabolic process (10) reduces by 2
>> ...to improve coverage (very slightly)
>>
>> Finally I include
>> GO:0006950 response to stress (444)
>> this terms has overlaps with most other processes so is largely 
>> redundant but are useful.
>>
>> This  leaves ~30 pombe with a process annotation unassigned to the GO 
>> slim; these are often to terms like homeostasis and its children, or 
>> otherwise uniformative terms
>>
>> For some purposes I would also include
>> GO:0065007 biological regulation  (1021)
>> but I don't know if this is a good term to include in a generic slim
>>
>> To make this work for multicellular eukaryotes, we would probably 
>> want to add non-cellular process terms like:
>>
>> developmental process
>> immune system process
>>
>>
>> * Note1 it is not currently possible to retrieve genes involved in 
>> general transcription as opposed to gene specific transcription (i.e 
>> RNA I,II and III polymerases etc),  with a single query. This is also 
>> important for enrichment as the genes in these 2 sets are very 
>> different in terms of species distribution, copy number and 
>> viability. I requested a grouping term for these processes a while 
>> ago and hopefully this will be implemented shortly.
>>
>> See:
>> https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764 
>>
>>
>>
>> Val
>>
>>
>>
>>
>>
>>
>> Ben Hitz wrote:
>>  
>>> Emily -
>>> I have interest in working on the generic go slim; I need it (or  
>>> something similar) to define graphics for an interaction network.
>>>
>>> Ben
>>>
>>>
>>> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote:
>>>
>>>      
>>>> Hi,
>>>>
>>>> From replying to a user request, I've just been having a quick look at
>>>> the composition of the generic GO slim, and relating the GO terms
>>>> included to the number of annotations displayed by AmiGO.
>>>>
>>>> Should, for instance, the 'cell recognition' term still be included in
>>>> the generic GO slim? - it has only been annotated to 182 gene  
>>>> products,
>>>> whereas its sibling terms: 'cell division', 'cell cycle' and 'cell
>>>> motility', have not been included even though they (directly or
>>>> indirectly) have been annotated to more than 1,200 gene products each.
>>>> Similarly, the term 'cytoplasm organization and biogenesis' is in  
>>>> the GO
>>>> slim but only has 113 gps annotated, whereas the 'membrane  
>>>> organisation
>>>> and biogenesis' term has been annotated to 1,509 gps.
>>>>
>>>> I was just wondering what the goal of the generic GO slim is... if  
>>>> terms
>>>> are selected on the basis that as many annotated gene products from
>>>> different organisms should get mapped to descriptive GO terms before
>>>> they are caught by the BP, MF, CC root terms (while also providing a
>>>> full selection of terms across the whole GO vocabulary), should we  
>>>> think
>>>> of reviewing its some of its composition in relation to overall
>>>> annotation frequency? Or should the GO slim be kept as stable as  
>>>> possible?
>>>>
>>>> Cheers,
>>>> Emily
>>>>
>>>> -- 
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------
>>>>
>>>>    Emily Dimmer Ph.D.
>>>>    GOA Coordinator
>>>>    EMBL-EBI
>>>>    Wellcome Trust Genome Campus
>>>>    Hinxton
>>>>    Cambridge CB10 1SD, U.K.
>>>>    Tel:     +44 1223 494654
>>>>    Fax:    +44 1223 494468
>>>>    email:  edimmer at ebi.ac.uk
>>>>    URL:    http://www.ebi.ac.uk/goa
>>>>
>>>>
>>>> _______________________________________________
>>>> Go mailing list
>>>> Go at geneontology.org
>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>           
>>> -- 
>>> Ben Hitz
>>> Senior Scientific Programmer ** Saccharomyces Genome Database ** GO  
>>> Consortium
>>> Stanford University ** hitz at genome.stanford.edu
>>>
>>>
>>>
>>> _______________________________________________
>>> Go mailing list
>>> Go at geneontology.org
>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>
>>>
>>>
>>>       
>>
>>
>>   
>
>
>


-- 
---------------------------------------------------------------------------
Valerie Wood			 Tel: 01223 496909
S. pombe Genome Project		 Fax: 01223 494919 		       
Wellcome Trust Sanger Institute	 email: val at sanger.ac.uk
Wellcome Trust Genome Campus	 http://www.genedb.org/genedb/pombe 
Hinxton, Cambridge, CB10 1HH	 http://www.sanger.ac.uk/Projects/S_pombe



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


More information about the Go mailing list