[Go] Composition of the generic GO slim
Judith Blake
jblake at informatics.jax.org
Mon May 5 10:41:15 PDT 2008
Val,
My point really is that experiments are done in context. A person
studying metabolism may want to break out those terms by particular
sub-divisions and lump other things. One of the roles of collaborating
GO people would be to add in the construction of particular slims if
requested.
For example, when I have done this, the researcher provided a list of
12-16 subdivisions that made sense for their purpose, and we constructed
a GO_slim that subdivided the GO appropriately. I think of it as part
of the data analysis process. A researcher using a generic GO_slim
without understanding the vagaries of the annotations or of the ontology
subtrees will neither understand the results.
my opinion.
judy
Valerie Wood wrote:
> Judy,
>
> You are correct that no one slim is going to fit all organisms or all
> uses.
> However it isn't simple to create an informative slim which gives
> complete
> (or nearly complete) coverage of all of an organisms annotations (and
> complete
> coverage of the annotation space is an important feature
> of a robust slim). Does the drosophila slim set cover all of the
> annotated genes?
>
> The slim I suggested will give complete coverage for single-celled
> eukaryotes (it may need additional high level terms to cover
> muliticellular eukaryotes). This particular slim is useful for evaluating
> an organisms "cell biology". Perhaps a very generic slim, which only
> includes
> very high level terms would be useful multicellular organisms,
> but it would not be so useful for single-celled organisms.
>
> One suggested criteria (6 in previou) suggested that terms be
> meaningful to biologists.
> What I meant here was that the terms should be was that the terms should
> be 'biologically informative'. For cellular roles, or for a single-celled
> organism 'metabolism isn't so useful as a 'direct' slim term (
> metabolic processes
> include transcription, translation, DNA replication, mRNA processing
> etc.,
> in addition to primary and secondary metabolism). For pombe 3102 of
> 4194 process annotated gene products are annotated to metabolism,
> so this term in a slim does not tell you very much.
>
> In addition, if metabolism is included as a 'direct' slim term, and
> you have a gene product
> which is annotated ONLY to "metabolic process" then you really know very
> little about its biological role. This can occur as frequently as it
> is possible to
> predict that a protein has catalytic activity, and is involved in a
> 'metabolic process'
> but not to say anything more specific; there are many direct Interpro
> mappings
> to these two terms. If I was trying to assess the 'real biological
> roles' of my organisms
> gene products, I would wish to exclude direct annotations to
> 'metabolic process' from the slim.
>
> A GO slim provides a mechanism to filter out annotations to high level
> relatively uninformative (with respect to the biological role) nodes
> like
> 'metabolism, cellular process, localization' (in the slim, they will
> be annotated
> to 'unknown' if there is no annotation to one of your slim terms or
> their children).
>
> Once you exclude a term like metabolism it becomes necessary
> to include all of the child terms (or a combination of child terms )
> to give complete
> coverage of the parent term ( NOTE: once the slimmed terms are mapped
> to the slim ontology the high level terms will be
> included, but their totals will only reflect the total of the gene
> products
> annotated via the terms in the slim).
>
> The difficult part is in building a slim is identifying the set of
> terms which
> provides complete coverage; this is the tricky step for most biologists,
> who are not so familiar with the ontologies. It would be useful to
> provide a
> starting slim which gives complete coverage of all annotations (using
> biologically relevant terms for common applications) which they can
> change as necessary.
> Maybe we should provide a set of 'complete coverage' slims for common
> applications.
>
> i.e.
> suitable for multicellular organisms and very general biological roles
> suitable for single-celled eukaryotes, or evaluating basic cellular
> processes
>
> Val
>
>
>
>
> Judith Blake wrote:
>> Val,
>> I still maintain that users need to be able to generate grouping
>> criteria based on their usage. I think we could go back to the fly
>> genome paper and see the primary molecular divisions that seemed most
>> useful to describe the genome properties. like 'reproduction' and
>> 'metabolism'. Anything more granular is specific to the user. A
>> mapping on this basis would likely include fewer than 20 terms and
>> would subdivide trees.
>>
>> judy
>>
>> Valerie Wood wrote:
>>> I think it is good idea for the consortium to provide an official
>>> 'GO slim', and advise people how they may want to alter the slim to
>>> fit their individual purpose.
>>>
>>> A useful generic GO slim has a number of qualities (I have tried to
>>> list these below, please suggest any additional ones, I hadn't
>>> really thought before about what the rules were I used for making a
>>> slim so this is the first time I have documented them). Following
>>> the 'guidelines' below I have suggested a set of process which I
>>> think should make up the generic process slim.
>>>
>>> Perhaps we could use this as a starting point, and people can
>>> suggest additional terms (with reasons) or terms which should be
>>> removed. This provides good coverage of basic cellular processes but
>>> would need extending to cover multicellular processes.
>>>
>>> GO Slim criteria
>>>
>>> 1. The generic slim should be as organism independent as possible
>>> (although clearly some terms will not be applicable to single celled
>>> eukaryotes and some eukaryotic terms will not be applicable to
>>> prokaryotes)
>>>
>>> 2. The slim should cover AS MANY genes with annotated processes as
>>> possible
>>>
>>> 3. The slim should cover AS MANY genes with annotated processes with
>>> the smallest number of leaf node terms (if you include too many
>>> terms and it becomes too large and you start to loose the advantages
>>> of a slim).
>>>
>>> 4. It might be useful to try to avoid terms with an excessively
>>> small or large number of small number of annotations (i.e ideally
>>> your terms will not have an extreme distributions for your histogram)
>>>
>>> 5. Preferably the slim should include sibling terms with a large
>>> overlaps between them. If you choose two siblings with 200 genes
>>> annotated to each, and the majority of the annotations overlap, it
>>> is usually better to select the parent node (i.e replace 2 terms by
>>> one single term). Conversely, if the child terms of a node fall
>>> into distinct non-overlapping subsets, it might be more informative
>>> to include both child terms in your slim (see also point 7 below)
>>>
>>> 6. For most purposes you need to include a representative term for
>>> all biologically relevant processes, by including terms which are
>>> meaningful to biologists.
>>>
>>> 7. If you are using your slim for data analysis (and not just for
>>> vizualization) you need to include terms which will allow you to
>>> distinguish genes bases on their biological properties.
>>> For example, it is not good to lump all genes involved in transport
>>> under transport because the genes annotated to distinct child terms;
>>> vesicle -mediated transport, protein targeting, transmembrane
>>> transport, are VERY different in term of their i) viability ii)
>>> species distribution iii) number of interaction partners iv) copy
>>> number v) expression pattern, so it does not make sense to lump them
>>> together in your slim set.
>>>
>>> Using these criteria this is the basic cellular process eukaryotic
>>> slim I use (or slight variations of): The number of annotations (for
>>> pombe obviously) is in parentheses (protein coding only).
>>>
>>> GO:0055085 transmembrane transport (278)
>>> GO:0006913 nucleocytoplasmic transport (114)
>>> GO:0006605 protein targeting (162)
>>> GO:0016192 vesicle-mediated transport (266)
>>> GO:0051186 cofactor metabolic process (139)
>>> GO:0006766 vitamin metabolic process (57)
>>> GO:0006790 sulfur metabolic process (45)
>>> GO:0006807 nitrogen compound metabolic process (224)
>>> GO:0055086 nucleobase, nucleoside and nucleotide metabolic process
>>> (118)
>>> GO:0005975 carbohydrate metabolic process (199)
>>> GO:0006629 lipid metabolic process (201)
>>> GO:0006399 tRNA metabolic process (125)
>>> GO:0006520 amino acid metabolic process (187)
>>> GO:0006412 translation (357)
>>> GO:0006259 DNA metabolic process (296)
>>> GO:0006508 protolysis (223)
>>> GO:0005975 carbohydrate metabolic process (199)
>>> GO:0016071 mRNA metabolic process (204)
>>> GO:0043413 biopolymer glycosylation (65) possibly drop?
>>> GO:0006464 protein modification process (585)
>>> GO:0007059 chromosome segregation (186)
>>> GO:0007049 cell cycle (552)
>>> GO:0007010 cytoskeletal organization and biogenesis (236)
>>> GO:0000910 cytokinesis (145)
>>> GO:0007165 signal transduction (362)
>>> GO:0006457 protein folding (80)
>>> GO:0042254 ribosome biogenesis and assembly (223)
>>> GO:0045229 external encapsulating structure organization and
>>> biogenesis (124)
>>> GO:xxxxxxxx general transcription (see note *1 below)
>>> GO:0032569 specific transcription from RNA polymerase II promoter (102)
>>> (total 424 for all transcription)
>>> GO:0000902 cell morphogenesis (86)
>>> GO:0006338 establishment and/or maintenance of chromatin
>>> architecture (231)
>>> GO:reproductive process (182)
>>> GO:0007005 mitochondrion organization and biogenesis (251)
>>> GO:0006091 generation of precursor metabolites and energy (113)
>>> GO:0007031 peroxisome organization and biogenesis (20)
>>>
>>> At this point there are about ~100 pombe genes (out of the 3960 with
>>> an annotated process term) which aren't included in the slim
>>>
>>> I could also include....
>>> vacuolar transport (91) reduces by 6 (most also annotated to protein
>>> targeting)
>>> telomere maintenance (54) reduces by 6 (most also annotated to DNA met)
>>> snoRNA metabolic process (10) reduces by 2
>>> ...to improve coverage (very slightly)
>>>
>>> Finally I include
>>> GO:0006950 response to stress (444)
>>> this terms has overlaps with most other processes so is largely
>>> redundant but are useful.
>>>
>>> This leaves ~30 pombe with a process annotation unassigned to the
>>> GO slim; these are often to terms like homeostasis and its children,
>>> or otherwise uniformative terms
>>>
>>> For some purposes I would also include
>>> GO:0065007 biological regulation (1021)
>>> but I don't know if this is a good term to include in a generic slim
>>>
>>> To make this work for multicellular eukaryotes, we would probably
>>> want to add non-cellular process terms like:
>>>
>>> developmental process
>>> immune system process
>>>
>>>
>>> * Note1 it is not currently possible to retrieve genes involved in
>>> general transcription as opposed to gene specific transcription (i.e
>>> RNA I,II and III polymerases etc), with a single query. This is
>>> also important for enrichment as the genes in these 2 sets are very
>>> different in terms of species distribution, copy number and
>>> viability. I requested a grouping term for these processes a while
>>> ago and hopefully this will be implemented shortly.
>>>
>>> See:
>>> https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764
>>>
>>>
>>>
>>> Val
>>>
>>>
>>>
>>>
>>>
>>>
>>> Ben Hitz wrote:
>>>
>>>> Emily -
>>>> I have interest in working on the generic go slim; I need it (or
>>>> something similar) to define graphics for an interaction network.
>>>>
>>>> Ben
>>>>
>>>>
>>>> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote:
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> From replying to a user request, I've just been having a quick
>>>>> look at
>>>>> the composition of the generic GO slim, and relating the GO terms
>>>>> included to the number of annotations displayed by AmiGO.
>>>>>
>>>>> Should, for instance, the 'cell recognition' term still be
>>>>> included in
>>>>> the generic GO slim? - it has only been annotated to 182 gene
>>>>> products,
>>>>> whereas its sibling terms: 'cell division', 'cell cycle' and 'cell
>>>>> motility', have not been included even though they (directly or
>>>>> indirectly) have been annotated to more than 1,200 gene products
>>>>> each.
>>>>> Similarly, the term 'cytoplasm organization and biogenesis' is in
>>>>> the GO
>>>>> slim but only has 113 gps annotated, whereas the 'membrane
>>>>> organisation
>>>>> and biogenesis' term has been annotated to 1,509 gps.
>>>>>
>>>>> I was just wondering what the goal of the generic GO slim is...
>>>>> if terms
>>>>> are selected on the basis that as many annotated gene products from
>>>>> different organisms should get mapped to descriptive GO terms before
>>>>> they are caught by the BP, MF, CC root terms (while also providing a
>>>>> full selection of terms across the whole GO vocabulary), should
>>>>> we think
>>>>> of reviewing its some of its composition in relation to overall
>>>>> annotation frequency? Or should the GO slim be kept as stable as
>>>>> possible?
>>>>>
>>>>> Cheers,
>>>>> Emily
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------
>>>>>
>>>>> Emily Dimmer Ph.D.
>>>>> GOA Coordinator
>>>>> EMBL-EBI
>>>>> Wellcome Trust Genome Campus
>>>>> Hinxton
>>>>> Cambridge CB10 1SD, U.K.
>>>>> Tel: +44 1223 494654
>>>>> Fax: +44 1223 494468
>>>>> email: edimmer at ebi.ac.uk
>>>>> URL: http://www.ebi.ac.uk/goa
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Go mailing list
>>>>> Go at geneontology.org
>>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>>
>>>> --
>>>> Ben Hitz
>>>> Senior Scientific Programmer ** Saccharomyces Genome Database **
>>>> GO Consortium
>>>> Stanford University ** hitz at genome.stanford.edu
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Go mailing list
>>>> Go at geneontology.org
>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
More information about the Go
mailing list