[Go] Composition of the generic GO slim
Eurie Hong
eurie at genome.stanford.edu
Mon May 12 15:03:10 PDT 2008
Because metagenomics is an emerging field, I do think we need to keep
in mind issues of the single-cell organism if a single generic GO slim
is to be relevant. But maybe that is another argument for having
separate multi-cellular and "cell biology" slims.
eurie
On May 9, 2008, at 5:10 AM, Jane Lomax wrote:
> I don't have any strong feelings about how the generic GO slim is
> generated, just as long as it's up-to-date and we have some
> documented,
> logical basis to how we do it.
>
> Lets not forget about this - it's important...
>
> Jane
>
> Judith Blake wrote:
>> ahhhh not another WG :)
>>
>> I think it might be accomplished by taking the 12-16 subdivisions in
>> either the human or fly genome papers that subdivide cellular roles,
>> and look for similar sets in a text book for CC and MF by chapter
>> titles. This number of subdivisions is the most useful for general
>> overview. I think the single-cell concerns may not be so important
>> at
>> this level of 'genericism'; some subdivisions might be more or less
>> devoid of annotaitons...or maybe it is useful to have two..but we
>> could start with one.
>>
>> Then figure out how to sum GO to those terms.
>>
>> Start with the biology, not the ontology.
>>
>> of course, I biasly think the MGI go-slim accomplishes this to some
>> extent.
>>
>> I think a draft of this could be done in a week by a dedicated
>> curator. but who? I'll think about this.
>>
>> Judy
>>
>>
>> Jane Lomax wrote:
>>> Hi - sorry, only just got to this thread...
>>>
>>> From an advocacy point of view I think it's crucial for us to
>>> provide
>>> a generic GO slim that's up to date with the ontologies. As others
>>> have said, most naive users are not going to have the knowledge to
>>> create their own tailored slims in the beginning, so we need to
>>> provide something general for them to start from, especially as the
>>> pre-built slims are now part of the AmiGO GO slim mapper. Users can
>>> then trim or expand as they see fit for their own purposes as they
>>> become more familiar with the technology.
>>>
>>> Users blindly using the generic slim in a formal analysis without an
>>> understanding of the underlying mechanism are, quite frankly, not
>>> performing good science. This should be weeded out at the level of
>>> peer review, just the same as with any other misuse of
>>> bioinformatics
>>> apps.
>>>
>>> Perhaps the documentation for the generic GO slim might say
>>> something
>>> like:
>>>
>>> "GO provides a generic GO slim which, like the GO itself, is not
>>> species specific. This should be a suitable starting point for most
>>> investigations as it has broad coverage over most annotations. Users
>>> should tailor this GO slim according to the specific requirements of
>>> their own research".
>>>
>>> I like Val's suggestions for creating the generic GO slim - how
>>> about
>>> we set up a WG?
>>>
>>> Jane
>>>
>>> Judith Blake wrote:
>>>> agreed,
>>>> we should remove or change the text to reflect reality.
>>>> judy
>>>>
>>>> Valerie Wood wrote:
>>>>
>>>>> The GO website makes the following statement, which is a bit
>>>>> misleading if we don't intend to provide any comprehensive
>>>>> slims....(as Emily pointed out earlier in this thread, this
>>>>> isn't a
>>>>> comprehensive slim....)
>>>>>
>>>>> "GO provides a generic GO slim which, like the GO itself, is not
>>>>> species specific, and which should be suitable for most purposes.
>>>>>
>>>>> So maybe this slim should not be decribed as such?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Judith Blake <jblake at informatics.jax.org> wrote:
>>>>>> Val,
>>>>>> My point really is that experiments are done in context. A
>>>>>> person
>>>>>> studying metabolism may want to break out those terms by
>>>>>> particular sub-divisions and lump other things. One of the roles
>>>>>> of collaborating GO people would be to add in the construction of
>>>>>> particular slims if requested.
>>>>>>
>>>>>> For example, when I have done this, the researcher provided a
>>>>>> list
>>>>>> of 12-16 subdivisions that made sense for their purpose, and we
>>>>>> constructed a GO_slim that subdivided the GO appropriately. I
>>>>>> think of it as part of the data analysis process. A researcher
>>>>>> using a generic GO_slim without understanding the vagaries of the
>>>>>> annotations or of the ontology subtrees will neither understand
>>>>>> the results.
>>>>>>
>>>>>> my opinion.
>>>>>> judy
>>>>>>
>>>>>> Valerie Wood wrote:
>>>>>>
>>>>>>> Judy,
>>>>>>>
>>>>>>> You are correct that no one slim is going to fit all organisms
>>>>>>> or all uses.
>>>>>>> However it isn't simple to create an informative slim which
>>>>>>> gives complete
>>>>>>> (or nearly complete) coverage of all of an organisms annotations
>>>>>>> (and complete
>>>>>>> coverage of the annotation space is an important feature
>>>>>>> of a robust slim). Does the drosophila slim set cover all of the
>>>>>>> annotated genes?
>>>>>>>
>>>>>>> The slim I suggested will give complete coverage for single-
>>>>>>> celled
>>>>>>> eukaryotes (it may need additional high level terms to cover
>>>>>>> muliticellular eukaryotes). This particular slim is useful for
>>>>>>> evaluating
>>>>>>> an organisms "cell biology". Perhaps a very generic slim, which
>>>>>>> only includes
>>>>>>> very high level terms would be useful multicellular organisms,
>>>>>>> but it would not be so useful for single-celled organisms.
>>>>>>>
>>>>>>> One suggested criteria (6 in previou) suggested that terms be
>>>>>>> meaningful to biologists.
>>>>>>> What I meant here was that the terms should be was that the
>>>>>>> terms
>>>>>>> should
>>>>>>> be 'biologically informative'. For cellular roles, or for a
>>>>>>> single-celled
>>>>>>> organism 'metabolism isn't so useful as a 'direct' slim term (
>>>>>>> metabolic processes
>>>>>>> include transcription, translation, DNA replication, mRNA
>>>>>>> processing etc.,
>>>>>>> in addition to primary and secondary metabolism). For pombe 3102
>>>>>>> of 4194 process annotated gene products are annotated to
>>>>>>> metabolism,
>>>>>>> so this term in a slim does not tell you very much.
>>>>>>>
>>>>>>> In addition, if metabolism is included as a 'direct' slim term,
>>>>>>> and you have a gene product
>>>>>>> which is annotated ONLY to "metabolic process" then you really
>>>>>>> know very
>>>>>>> little about its biological role. This can occur as frequently
>>>>>>> as
>>>>>>> it is possible to
>>>>>>> predict that a protein has catalytic activity, and is involved
>>>>>>> in
>>>>>>> a 'metabolic process'
>>>>>>> but not to say anything more specific; there are many direct
>>>>>>> Interpro mappings
>>>>>>> to these two terms. If I was trying to assess the 'real
>>>>>>> biological roles' of my organisms
>>>>>>> gene products, I would wish to exclude direct annotations to
>>>>>>> 'metabolic process' from the slim.
>>>>>>>
>>>>>>> A GO slim provides a mechanism to filter out annotations to high
>>>>>>> level
>>>>>>> relatively uninformative (with respect to the biological role)
>>>>>>> nodes like
>>>>>>> 'metabolism, cellular process, localization' (in the slim, they
>>>>>>> will be annotated
>>>>>>> to 'unknown' if there is no annotation to one of your slim
>>>>>>> terms or their children).
>>>>>>>
>>>>>>> Once you exclude a term like metabolism it becomes necessary
>>>>>>> to include all of the child terms (or a combination of child
>>>>>>> terms ) to give complete
>>>>>>> coverage of the parent term ( NOTE: once the slimmed terms are
>>>>>>> mapped
>>>>>>> to the slim ontology the high level terms will be
>>>>>>> included, but their totals will only reflect the total of the
>>>>>>> gene products
>>>>>>> annotated via the terms in the slim).
>>>>>>>
>>>>>>> The difficult part is in building a slim is identifying the set
>>>>>>> of terms which
>>>>>>> provides complete coverage; this is the tricky step for most
>>>>>>> biologists,
>>>>>>> who are not so familiar with the ontologies. It would be useful
>>>>>>> to provide a
>>>>>>> starting slim which gives complete coverage of all annotations
>>>>>>> (using
>>>>>>> biologically relevant terms for common applications) which they
>>>>>>> can change as necessary.
>>>>>>> Maybe we should provide a set of 'complete coverage' slims for
>>>>>>> common
>>>>>>> applications.
>>>>>>>
>>>>>>> i.e.
>>>>>>> suitable for multicellular organisms and very general biological
>>>>>>> roles
>>>>>>> suitable for single-celled eukaryotes, or evaluating basic
>>>>>>> cellular processes
>>>>>>>
>>>>>>> Val
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Judith Blake wrote:
>>>>>>>
>>>>>>>> Val,
>>>>>>>> I still maintain that users need to be able to generate
>>>>>>>> grouping
>>>>>>>> criteria based on their usage. I think we could go back to
>>>>>>>> the fly genome paper and see the primary molecular divisions
>>>>>>>> that seemed most useful to describe the genome properties.
>>>>>>>> like
>>>>>>>> 'reproduction' and 'metabolism'. Anything more granular is
>>>>>>>> specific to the user. A mapping on this basis would likely
>>>>>>>> include fewer than 20 terms and would subdivide trees.
>>>>>>>>
>>>>>>>> judy
>>>>>>>>
>>>>>>>> Valerie Wood wrote:
>>>>>>>>
>>>>>>>>> I think it is good idea for the consortium to provide an
>>>>>>>>> official 'GO slim', and advise people how they may want to
>>>>>>>>> alter the slim to fit their individual purpose.
>>>>>>>>>
>>>>>>>>> A useful generic GO slim has a number of qualities (I have
>>>>>>>>> tried to list these below, please suggest any additional ones,
>>>>>>>>> I hadn't really thought before about what the rules were I
>>>>>>>>> used
>>>>>>>>> for making a slim so this is the first time I have documented
>>>>>>>>> them). Following the 'guidelines' below I have suggested a set
>>>>>>>>> of process which I think should make up the generic process
>>>>>>>>> slim.
>>>>>>>>>
>>>>>>>>> Perhaps we could use this as a starting point, and people can
>>>>>>>>> suggest additional terms (with reasons) or terms which should
>>>>>>>>> be removed. This provides good coverage of basic cellular
>>>>>>>>> processes but would need extending to cover multicellular
>>>>>>>>> processes.
>>>>>>>>>
>>>>>>>>> GO Slim criteria
>>>>>>>>>
>>>>>>>>> 1. The generic slim should be as organism independent as
>>>>>>>>> possible (although clearly some terms will not be applicable
>>>>>>>>> to
>>>>>>>>> single celled eukaryotes and some eukaryotic terms will not be
>>>>>>>>> applicable to prokaryotes)
>>>>>>>>>
>>>>>>>>> 2. The slim should cover AS MANY genes with annotated
>>>>>>>>> processes
>>>>>>>>> as possible
>>>>>>>>>
>>>>>>>>> 3. The slim should cover AS MANY genes with annotated
>>>>>>>>> processes
>>>>>>>>> with the smallest number of leaf node terms (if you include
>>>>>>>>> too
>>>>>>>>> many terms and it becomes too large and you start to loose the
>>>>>>>>> advantages of a slim).
>>>>>>>>>
>>>>>>>>> 4. It might be useful to try to avoid terms with an
>>>>>>>>> excessively
>>>>>>>>> small or large number of small number of annotations (i.e
>>>>>>>>> ideally your terms will not have an extreme distributions for
>>>>>>>>> your histogram)
>>>>>>>>>
>>>>>>>>> 5. Preferably the slim should include sibling terms with a
>>>>>>>>> large overlaps between them. If you choose two siblings with
>>>>>>>>> 200 genes annotated to each, and the majority of the
>>>>>>>>> annotations overlap, it is usually better to select the
>>>>>>>>> parent
>>>>>>>>> node (i.e replace 2 terms by one single term). Conversely, if
>>>>>>>>> the child terms of a node fall into distinct non-overlapping
>>>>>>>>> subsets, it might be more informative to include both child
>>>>>>>>> terms in your slim (see also point 7 below)
>>>>>>>>>
>>>>>>>>> 6. For most purposes you need to include a representative term
>>>>>>>>> for all biologically relevant processes, by including terms
>>>>>>>>> which are meaningful to biologists.
>>>>>>>>>
>>>>>>>>> 7. If you are using your slim for data analysis (and not just
>>>>>>>>> for vizualization) you need to include terms which will allow
>>>>>>>>> you to distinguish genes bases on their biological properties.
>>>>>>>>> For example, it is not good to lump all genes involved in
>>>>>>>>> transport under transport because the genes annotated to
>>>>>>>>> distinct child terms; vesicle -mediated transport, protein
>>>>>>>>> targeting, transmembrane transport, are VERY different in term
>>>>>>>>> of their i) viability ii) species distribution iii) number of
>>>>>>>>> interaction partners iv) copy number v) expression pattern, so
>>>>>>>>> it does not make sense to lump them together in your slim set.
>>>>>>>>>
>>>>>>>>> Using these criteria this is the basic cellular process
>>>>>>>>> eukaryotic slim I use (or slight variations of): The number of
>>>>>>>>> annotations (for pombe obviously) is in parentheses (protein
>>>>>>>>> coding only).
>>>>>>>>>
>>>>>>>>> GO:0055085 transmembrane transport (278)
>>>>>>>>> GO:0006913 nucleocytoplasmic transport (114)
>>>>>>>>> GO:0006605 protein targeting (162)
>>>>>>>>> GO:0016192 vesicle-mediated transport (266)
>>>>>>>>> GO:0051186 cofactor metabolic process (139)
>>>>>>>>> GO:0006766 vitamin metabolic process (57)
>>>>>>>>> GO:0006790 sulfur metabolic process (45)
>>>>>>>>> GO:0006807 nitrogen compound metabolic process (224)
>>>>>>>>> GO:0055086 nucleobase, nucleoside and nucleotide metabolic
>>>>>>>>> process (118)
>>>>>>>>> GO:0005975 carbohydrate metabolic process (199)
>>>>>>>>> GO:0006629 lipid metabolic process (201)
>>>>>>>>> GO:0006399 tRNA metabolic process (125)
>>>>>>>>> GO:0006520 amino acid metabolic process (187)
>>>>>>>>> GO:0006412 translation (357)
>>>>>>>>> GO:0006259 DNA metabolic process (296)
>>>>>>>>> GO:0006508 protolysis (223)
>>>>>>>>> GO:0005975 carbohydrate metabolic process (199)
>>>>>>>>> GO:0016071 mRNA metabolic process (204)
>>>>>>>>> GO:0043413 biopolymer glycosylation (65) possibly drop?
>>>>>>>>> GO:0006464 protein modification process (585)
>>>>>>>>> GO:0007059 chromosome segregation (186)
>>>>>>>>> GO:0007049 cell cycle (552)
>>>>>>>>> GO:0007010 cytoskeletal organization and biogenesis (236)
>>>>>>>>> GO:0000910 cytokinesis (145)
>>>>>>>>> GO:0007165 signal transduction (362)
>>>>>>>>> GO:0006457 protein folding (80)
>>>>>>>>> GO:0042254 ribosome biogenesis and assembly (223)
>>>>>>>>> GO:0045229 external encapsulating structure organization and
>>>>>>>>> biogenesis (124)
>>>>>>>>> GO:xxxxxxxx general transcription (see note *1 below)
>>>>>>>>> GO:0032569 specific transcription from RNA polymerase II
>>>>>>>>> promoter (102)
>>>>>>>>> (total 424 for all transcription)
>>>>>>>>> GO:0000902 cell morphogenesis (86)
>>>>>>>>> GO:0006338 establishment and/or maintenance of chromatin
>>>>>>>>> architecture (231)
>>>>>>>>> GO:reproductive process (182)
>>>>>>>>> GO:0007005 mitochondrion organization and biogenesis (251)
>>>>>>>>> GO:0006091 generation of precursor metabolites and energy
>>>>>>>>> (113)
>>>>>>>>> GO:0007031 peroxisome organization and biogenesis (20)
>>>>>>>>>
>>>>>>>>> At this point there are about ~100 pombe genes (out of the
>>>>>>>>> 3960
>>>>>>>>> with an annotated process term) which aren't included in the
>>>>>>>>> slim
>>>>>>>>>
>>>>>>>>> I could also include....
>>>>>>>>> vacuolar transport (91) reduces by 6 (most also annotated to
>>>>>>>>> protein targeting)
>>>>>>>>> telomere maintenance (54) reduces by 6 (most also annotated to
>>>>>>>>> DNA met)
>>>>>>>>> snoRNA metabolic process (10) reduces by 2
>>>>>>>>> ...to improve coverage (very slightly)
>>>>>>>>>
>>>>>>>>> Finally I include
>>>>>>>>> GO:0006950 response to stress (444)
>>>>>>>>> this terms has overlaps with most other processes so is
>>>>>>>>> largely
>>>>>>>>> redundant but are useful.
>>>>>>>>>
>>>>>>>>> This leaves ~30 pombe with a process annotation unassigned to
>>>>>>>>> the GO slim; these are often to terms like homeostasis and its
>>>>>>>>> children, or otherwise uniformative terms
>>>>>>>>>
>>>>>>>>> For some purposes I would also include
>>>>>>>>> GO:0065007 biological regulation (1021)
>>>>>>>>> but I don't know if this is a good term to include in a
>>>>>>>>> generic
>>>>>>>>> slim
>>>>>>>>>
>>>>>>>>> To make this work for multicellular eukaryotes, we would
>>>>>>>>> probably want to add non-cellular process terms like:
>>>>>>>>>
>>>>>>>>> developmental process
>>>>>>>>> immune system process
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> * Note1 it is not currently possible to retrieve genes
>>>>>>>>> involved
>>>>>>>>> in general transcription as opposed to gene specific
>>>>>>>>> transcription (i.e RNA I,II and III polymerases etc), with a
>>>>>>>>> single query. This is also important for enrichment as the
>>>>>>>>> genes in these 2 sets are very different in terms of species
>>>>>>>>> distribution, copy number and viability. I requested a
>>>>>>>>> grouping
>>>>>>>>> term for these processes a while ago and hopefully this will
>>>>>>>>> be
>>>>>>>>> implemented shortly.
>>>>>>>>>
>>>>>>>>> See:
>>>>>>>>> https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Val
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ben Hitz wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Emily -
>>>>>>>>>> I have interest in working on the generic go slim; I need it
>>>>>>>>>> (or something similar) to define graphics for an interaction
>>>>>>>>>> network.
>>>>>>>>>>
>>>>>>>>>> Ben
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> From replying to a user request, I've just been having a
>>>>>>>>>>> quick look at
>>>>>>>>>>> the composition of the generic GO slim, and relating the GO
>>>>>>>>>>> terms
>>>>>>>>>>> included to the number of annotations displayed by AmiGO.
>>>>>>>>>>>
>>>>>>>>>>> Should, for instance, the 'cell recognition' term still be
>>>>>>>>>>> included in
>>>>>>>>>>> the generic GO slim? - it has only been annotated to 182
>>>>>>>>>>> gene products,
>>>>>>>>>>> whereas its sibling terms: 'cell division', 'cell cycle' and
>>>>>>>>>>> 'cell
>>>>>>>>>>> motility', have not been included even though they
>>>>>>>>>>> (directly or
>>>>>>>>>>> indirectly) have been annotated to more than 1,200 gene
>>>>>>>>>>> products each.
>>>>>>>>>>> Similarly, the term 'cytoplasm organization and biogenesis'
>>>>>>>>>>> is in the GO
>>>>>>>>>>> slim but only has 113 gps annotated, whereas the 'membrane
>>>>>>>>>>> organisation
>>>>>>>>>>> and biogenesis' term has been annotated to 1,509 gps.
>>>>>>>>>>>
>>>>>>>>>>> I was just wondering what the goal of the generic GO slim
>>>>>>>>>>> is... if terms
>>>>>>>>>>> are selected on the basis that as many annotated gene
>>>>>>>>>>> products from
>>>>>>>>>>> different organisms should get mapped to descriptive GO
>>>>>>>>>>> terms
>>>>>>>>>>> before
>>>>>>>>>>> they are caught by the BP, MF, CC root terms (while also
>>>>>>>>>>> providing a
>>>>>>>>>>> full selection of terms across the whole GO vocabulary),
>>>>>>>>>>> should we think
>>>>>>>>>>> of reviewing its some of its composition in relation to
>>>>>>>>>>> overall
>>>>>>>>>>> annotation frequency? Or should the GO slim be kept as
>>>>>>>>>>> stable
>>>>>>>>>>> as possible?
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Emily
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Emily Dimmer Ph.D.
>>>>>>>>>>> GOA Coordinator
>>>>>>>>>>> EMBL-EBI
>>>>>>>>>>> Wellcome Trust Genome Campus
>>>>>>>>>>> Hinxton
>>>>>>>>>>> Cambridge CB10 1SD, U.K.
>>>>>>>>>>> Tel: +44 1223 494654
>>>>>>>>>>> Fax: +44 1223 494468
>>>>>>>>>>> email: edimmer at ebi.ac.uk
>>>>>>>>>>> URL: http://www.ebi.ac.uk/goa
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Go mailing list
>>>>>>>>>>> Go at geneontology.org
>>>>>>>>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ben Hitz
>>>>>>>>>> Senior Scientific Programmer ** Saccharomyces Genome Database
>>>>>>>>>> ** GO Consortium
>>>>>>>>>> Stanford University ** hitz at genome.stanford.edu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Go mailing list
>>>>>>>>>> Go at geneontology.org
>>>>>>>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> Go mailing list
>>>> Go at geneontology.org
>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>
>>>
>>
>
> _______________________________________________
> Go mailing list
> Go at geneontology.org
> http://fafner.stanford.edu/mailman/listinfo/go
More information about the Go
mailing list