[Go] Composition of the generic GO slim
Judith Blake
jblake at informatics.jax.org
Fri May 9 04:41:43 PDT 2008
ahhhh not another WG :)
I think it might be accomplished by taking the 12-16 subdivisions in
either the human or fly genome papers that subdivide cellular roles, and
look for similar sets in a text book for CC and MF by chapter titles.
This number of subdivisions is the most useful for general overview. I
think the single-cell concerns may not be so important at this level of
'genericism'; some subdivisions might be more or less devoid of
annotaitons...or maybe it is useful to have two..but we could start with
one.
Then figure out how to sum GO to those terms.
Start with the biology, not the ontology.
of course, I biasly think the MGI go-slim accomplishes this to some extent.
I think a draft of this could be done in a week by a dedicated curator.
but who? I'll think about this.
Judy
Jane Lomax wrote:
> Hi - sorry, only just got to this thread...
>
> From an advocacy point of view I think it's crucial for us to provide
> a generic GO slim that's up to date with the ontologies. As others
> have said, most naive users are not going to have the knowledge to
> create their own tailored slims in the beginning, so we need to
> provide something general for them to start from, especially as the
> pre-built slims are now part of the AmiGO GO slim mapper. Users can
> then trim or expand as they see fit for their own purposes as they
> become more familiar with the technology.
>
> Users blindly using the generic slim in a formal analysis without an
> understanding of the underlying mechanism are, quite frankly, not
> performing good science. This should be weeded out at the level of
> peer review, just the same as with any other misuse of bioinformatics
> apps.
>
> Perhaps the documentation for the generic GO slim might say something
> like:
>
> "GO provides a generic GO slim which, like the GO itself, is not
> species specific. This should be a suitable starting point for most
> investigations as it has broad coverage over most annotations. Users
> should tailor this GO slim according to the specific requirements of
> their own research".
>
> I like Val's suggestions for creating the generic GO slim - how about
> we set up a WG?
>
> Jane
>
> Judith Blake wrote:
>> agreed,
>> we should remove or change the text to reflect reality.
>> judy
>>
>> Valerie Wood wrote:
>>
>>> The GO website makes the following statement, which is a bit
>>> misleading if we don't intend to provide any comprehensive
>>> slims....(as Emily pointed out earlier in this thread, this isn't a
>>> comprehensive slim....)
>>>
>>> "GO provides a generic GO slim which, like the GO itself, is not
>>> species specific, and which should be suitable for most purposes.
>>>
>>> So maybe this slim should not be decribed as such?
>>>
>>>
>>>
>>>
>>> Judith Blake <jblake at informatics.jax.org> wrote:
>>>> Val,
>>>> My point really is that experiments are done in context. A person
>>>> studying metabolism may want to break out those terms by particular
>>>> sub-divisions and lump other things. One of the roles of
>>>> collaborating GO people would be to add in the construction of
>>>> particular slims if requested.
>>>>
>>>> For example, when I have done this, the researcher provided a list
>>>> of 12-16 subdivisions that made sense for their purpose, and we
>>>> constructed a GO_slim that subdivided the GO appropriately. I
>>>> think of it as part of the data analysis process. A researcher
>>>> using a generic GO_slim without understanding the vagaries of the
>>>> annotations or of the ontology subtrees will neither understand the
>>>> results.
>>>>
>>>> my opinion.
>>>> judy
>>>>
>>>> Valerie Wood wrote:
>>>>
>>>>> Judy,
>>>>>
>>>>> You are correct that no one slim is going to fit all organisms or
>>>>> all uses.
>>>>> However it isn't simple to create an informative slim which gives
>>>>> complete
>>>>> (or nearly complete) coverage of all of an organisms annotations
>>>>> (and complete
>>>>> coverage of the annotation space is an important feature
>>>>> of a robust slim). Does the drosophila slim set cover all of the
>>>>> annotated genes?
>>>>>
>>>>> The slim I suggested will give complete coverage for single-celled
>>>>> eukaryotes (it may need additional high level terms to cover
>>>>> muliticellular eukaryotes). This particular slim is useful for
>>>>> evaluating
>>>>> an organisms "cell biology". Perhaps a very generic slim, which
>>>>> only includes
>>>>> very high level terms would be useful multicellular organisms,
>>>>> but it would not be so useful for single-celled organisms.
>>>>>
>>>>> One suggested criteria (6 in previou) suggested that terms be
>>>>> meaningful to biologists.
>>>>> What I meant here was that the terms should be was that the terms
>>>>> should
>>>>> be 'biologically informative'. For cellular roles, or for a
>>>>> single-celled
>>>>> organism 'metabolism isn't so useful as a 'direct' slim term (
>>>>> metabolic processes
>>>>> include transcription, translation, DNA replication, mRNA
>>>>> processing etc.,
>>>>> in addition to primary and secondary metabolism). For pombe 3102
>>>>> of 4194 process annotated gene products are annotated to metabolism,
>>>>> so this term in a slim does not tell you very much.
>>>>>
>>>>> In addition, if metabolism is included as a 'direct' slim term,
>>>>> and you have a gene product
>>>>> which is annotated ONLY to "metabolic process" then you really
>>>>> know very
>>>>> little about its biological role. This can occur as frequently as
>>>>> it is possible to
>>>>> predict that a protein has catalytic activity, and is involved in
>>>>> a 'metabolic process'
>>>>> but not to say anything more specific; there are many direct
>>>>> Interpro mappings
>>>>> to these two terms. If I was trying to assess the 'real
>>>>> biological roles' of my organisms
>>>>> gene products, I would wish to exclude direct annotations to
>>>>> 'metabolic process' from the slim.
>>>>>
>>>>> A GO slim provides a mechanism to filter out annotations to high
>>>>> level
>>>>> relatively uninformative (with respect to the biological role)
>>>>> nodes like
>>>>> 'metabolism, cellular process, localization' (in the slim, they
>>>>> will be annotated
>>>>> to 'unknown' if there is no annotation to one of your slim terms
>>>>> or their children).
>>>>>
>>>>> Once you exclude a term like metabolism it becomes necessary
>>>>> to include all of the child terms (or a combination of child terms
>>>>> ) to give complete
>>>>> coverage of the parent term ( NOTE: once the slimmed terms are mapped
>>>>> to the slim ontology the high level terms will be
>>>>> included, but their totals will only reflect the total of the
>>>>> gene products
>>>>> annotated via the terms in the slim).
>>>>>
>>>>> The difficult part is in building a slim is identifying the set of
>>>>> terms which
>>>>> provides complete coverage; this is the tricky step for most
>>>>> biologists,
>>>>> who are not so familiar with the ontologies. It would be useful to
>>>>> provide a
>>>>> starting slim which gives complete coverage of all annotations (using
>>>>> biologically relevant terms for common applications) which they
>>>>> can change as necessary.
>>>>> Maybe we should provide a set of 'complete coverage' slims for common
>>>>> applications.
>>>>>
>>>>> i.e.
>>>>> suitable for multicellular organisms and very general biological
>>>>> roles
>>>>> suitable for single-celled eukaryotes, or evaluating basic
>>>>> cellular processes
>>>>>
>>>>> Val
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Judith Blake wrote:
>>>>>
>>>>>> Val,
>>>>>> I still maintain that users need to be able to generate grouping
>>>>>> criteria based on their usage. I think we could go back to the
>>>>>> fly genome paper and see the primary molecular divisions that
>>>>>> seemed most useful to describe the genome properties. like
>>>>>> 'reproduction' and 'metabolism'. Anything more granular is
>>>>>> specific to the user. A mapping on this basis would likely
>>>>>> include fewer than 20 terms and would subdivide trees.
>>>>>>
>>>>>> judy
>>>>>>
>>>>>> Valerie Wood wrote:
>>>>>>
>>>>>>> I think it is good idea for the consortium to provide an
>>>>>>> official 'GO slim', and advise people how they may want to alter
>>>>>>> the slim to fit their individual purpose.
>>>>>>>
>>>>>>> A useful generic GO slim has a number of qualities (I have tried
>>>>>>> to list these below, please suggest any additional ones, I
>>>>>>> hadn't really thought before about what the rules were I used
>>>>>>> for making a slim so this is the first time I have documented
>>>>>>> them). Following the 'guidelines' below I have suggested a set
>>>>>>> of process which I think should make up the generic process slim.
>>>>>>>
>>>>>>> Perhaps we could use this as a starting point, and people can
>>>>>>> suggest additional terms (with reasons) or terms which should be
>>>>>>> removed. This provides good coverage of basic cellular processes
>>>>>>> but would need extending to cover multicellular processes.
>>>>>>>
>>>>>>> GO Slim criteria
>>>>>>>
>>>>>>> 1. The generic slim should be as organism independent as
>>>>>>> possible (although clearly some terms will not be applicable to
>>>>>>> single celled eukaryotes and some eukaryotic terms will not be
>>>>>>> applicable to prokaryotes)
>>>>>>>
>>>>>>> 2. The slim should cover AS MANY genes with annotated processes
>>>>>>> as possible
>>>>>>>
>>>>>>> 3. The slim should cover AS MANY genes with annotated processes
>>>>>>> with the smallest number of leaf node terms (if you include too
>>>>>>> many terms and it becomes too large and you start to loose the
>>>>>>> advantages of a slim).
>>>>>>>
>>>>>>> 4. It might be useful to try to avoid terms with an excessively
>>>>>>> small or large number of small number of annotations (i.e
>>>>>>> ideally your terms will not have an extreme distributions for
>>>>>>> your histogram)
>>>>>>>
>>>>>>> 5. Preferably the slim should include sibling terms with a
>>>>>>> large overlaps between them. If you choose two siblings with 200
>>>>>>> genes annotated to each, and the majority of the annotations
>>>>>>> overlap, it is usually better to select the parent node (i.e
>>>>>>> replace 2 terms by one single term). Conversely, if the child
>>>>>>> terms of a node fall into distinct non-overlapping subsets, it
>>>>>>> might be more informative to include both child terms in your
>>>>>>> slim (see also point 7 below)
>>>>>>>
>>>>>>> 6. For most purposes you need to include a representative term
>>>>>>> for all biologically relevant processes, by including terms
>>>>>>> which are meaningful to biologists.
>>>>>>>
>>>>>>> 7. If you are using your slim for data analysis (and not just
>>>>>>> for vizualization) you need to include terms which will allow
>>>>>>> you to distinguish genes bases on their biological properties.
>>>>>>> For example, it is not good to lump all genes involved in
>>>>>>> transport under transport because the genes annotated to
>>>>>>> distinct child terms; vesicle -mediated transport, protein
>>>>>>> targeting, transmembrane transport, are VERY different in term
>>>>>>> of their i) viability ii) species distribution iii) number of
>>>>>>> interaction partners iv) copy number v) expression pattern, so
>>>>>>> it does not make sense to lump them together in your slim set.
>>>>>>>
>>>>>>> Using these criteria this is the basic cellular process
>>>>>>> eukaryotic slim I use (or slight variations of): The number of
>>>>>>> annotations (for pombe obviously) is in parentheses (protein
>>>>>>> coding only).
>>>>>>>
>>>>>>> GO:0055085 transmembrane transport (278)
>>>>>>> GO:0006913 nucleocytoplasmic transport (114)
>>>>>>> GO:0006605 protein targeting (162)
>>>>>>> GO:0016192 vesicle-mediated transport (266)
>>>>>>> GO:0051186 cofactor metabolic process (139)
>>>>>>> GO:0006766 vitamin metabolic process (57)
>>>>>>> GO:0006790 sulfur metabolic process (45)
>>>>>>> GO:0006807 nitrogen compound metabolic process (224)
>>>>>>> GO:0055086 nucleobase, nucleoside and nucleotide metabolic
>>>>>>> process (118)
>>>>>>> GO:0005975 carbohydrate metabolic process (199)
>>>>>>> GO:0006629 lipid metabolic process (201)
>>>>>>> GO:0006399 tRNA metabolic process (125)
>>>>>>> GO:0006520 amino acid metabolic process (187)
>>>>>>> GO:0006412 translation (357)
>>>>>>> GO:0006259 DNA metabolic process (296)
>>>>>>> GO:0006508 protolysis (223)
>>>>>>> GO:0005975 carbohydrate metabolic process (199)
>>>>>>> GO:0016071 mRNA metabolic process (204)
>>>>>>> GO:0043413 biopolymer glycosylation (65) possibly drop?
>>>>>>> GO:0006464 protein modification process (585)
>>>>>>> GO:0007059 chromosome segregation (186)
>>>>>>> GO:0007049 cell cycle (552)
>>>>>>> GO:0007010 cytoskeletal organization and biogenesis (236)
>>>>>>> GO:0000910 cytokinesis (145)
>>>>>>> GO:0007165 signal transduction (362)
>>>>>>> GO:0006457 protein folding (80)
>>>>>>> GO:0042254 ribosome biogenesis and assembly (223)
>>>>>>> GO:0045229 external encapsulating structure organization and
>>>>>>> biogenesis (124)
>>>>>>> GO:xxxxxxxx general transcription (see note *1 below)
>>>>>>> GO:0032569 specific transcription from RNA polymerase II
>>>>>>> promoter (102)
>>>>>>> (total 424 for all transcription)
>>>>>>> GO:0000902 cell morphogenesis (86)
>>>>>>> GO:0006338 establishment and/or maintenance of chromatin
>>>>>>> architecture (231)
>>>>>>> GO:reproductive process (182)
>>>>>>> GO:0007005 mitochondrion organization and biogenesis (251)
>>>>>>> GO:0006091 generation of precursor metabolites and energy (113)
>>>>>>> GO:0007031 peroxisome organization and biogenesis (20)
>>>>>>>
>>>>>>> At this point there are about ~100 pombe genes (out of the 3960
>>>>>>> with an annotated process term) which aren't included in the slim
>>>>>>>
>>>>>>> I could also include....
>>>>>>> vacuolar transport (91) reduces by 6 (most also annotated to
>>>>>>> protein targeting)
>>>>>>> telomere maintenance (54) reduces by 6 (most also annotated to
>>>>>>> DNA met)
>>>>>>> snoRNA metabolic process (10) reduces by 2
>>>>>>> ...to improve coverage (very slightly)
>>>>>>>
>>>>>>> Finally I include
>>>>>>> GO:0006950 response to stress (444)
>>>>>>> this terms has overlaps with most other processes so is largely
>>>>>>> redundant but are useful.
>>>>>>>
>>>>>>> This leaves ~30 pombe with a process annotation unassigned to
>>>>>>> the GO slim; these are often to terms like homeostasis and its
>>>>>>> children, or otherwise uniformative terms
>>>>>>>
>>>>>>> For some purposes I would also include
>>>>>>> GO:0065007 biological regulation (1021)
>>>>>>> but I don't know if this is a good term to include in a generic
>>>>>>> slim
>>>>>>>
>>>>>>> To make this work for multicellular eukaryotes, we would
>>>>>>> probably want to add non-cellular process terms like:
>>>>>>>
>>>>>>> developmental process
>>>>>>> immune system process
>>>>>>>
>>>>>>>
>>>>>>> * Note1 it is not currently possible to retrieve genes involved
>>>>>>> in general transcription as opposed to gene specific
>>>>>>> transcription (i.e RNA I,II and III polymerases etc), with a
>>>>>>> single query. This is also important for enrichment as the genes
>>>>>>> in these 2 sets are very different in terms of species
>>>>>>> distribution, copy number and viability. I requested a grouping
>>>>>>> term for these processes a while ago and hopefully this will be
>>>>>>> implemented shortly.
>>>>>>>
>>>>>>> See:
>>>>>>> https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Val
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Ben Hitz wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Emily -
>>>>>>>> I have interest in working on the generic go slim; I need it
>>>>>>>> (or something similar) to define graphics for an interaction
>>>>>>>> network.
>>>>>>>>
>>>>>>>> Ben
>>>>>>>>
>>>>>>>>
>>>>>>>> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> From replying to a user request, I've just been having a quick
>>>>>>>>> look at
>>>>>>>>> the composition of the generic GO slim, and relating the GO terms
>>>>>>>>> included to the number of annotations displayed by AmiGO.
>>>>>>>>>
>>>>>>>>> Should, for instance, the 'cell recognition' term still be
>>>>>>>>> included in
>>>>>>>>> the generic GO slim? - it has only been annotated to 182 gene
>>>>>>>>> products,
>>>>>>>>> whereas its sibling terms: 'cell division', 'cell cycle' and
>>>>>>>>> 'cell
>>>>>>>>> motility', have not been included even though they (directly or
>>>>>>>>> indirectly) have been annotated to more than 1,200 gene
>>>>>>>>> products each.
>>>>>>>>> Similarly, the term 'cytoplasm organization and biogenesis' is
>>>>>>>>> in the GO
>>>>>>>>> slim but only has 113 gps annotated, whereas the 'membrane
>>>>>>>>> organisation
>>>>>>>>> and biogenesis' term has been annotated to 1,509 gps.
>>>>>>>>>
>>>>>>>>> I was just wondering what the goal of the generic GO slim
>>>>>>>>> is... if terms
>>>>>>>>> are selected on the basis that as many annotated gene products
>>>>>>>>> from
>>>>>>>>> different organisms should get mapped to descriptive GO terms
>>>>>>>>> before
>>>>>>>>> they are caught by the BP, MF, CC root terms (while also
>>>>>>>>> providing a
>>>>>>>>> full selection of terms across the whole GO vocabulary),
>>>>>>>>> should we think
>>>>>>>>> of reviewing its some of its composition in relation to overall
>>>>>>>>> annotation frequency? Or should the GO slim be kept as stable
>>>>>>>>> as possible?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Emily
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Emily Dimmer Ph.D.
>>>>>>>>> GOA Coordinator
>>>>>>>>> EMBL-EBI
>>>>>>>>> Wellcome Trust Genome Campus
>>>>>>>>> Hinxton
>>>>>>>>> Cambridge CB10 1SD, U.K.
>>>>>>>>> Tel: +44 1223 494654
>>>>>>>>> Fax: +44 1223 494468
>>>>>>>>> email: edimmer at ebi.ac.uk
>>>>>>>>> URL: http://www.ebi.ac.uk/goa
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Go mailing list
>>>>>>>>> Go at geneontology.org
>>>>>>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Ben Hitz
>>>>>>>> Senior Scientific Programmer ** Saccharomyces Genome Database
>>>>>>>> ** GO Consortium
>>>>>>>> Stanford University ** hitz at genome.stanford.edu
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Go mailing list
>>>>>>>> Go at geneontology.org
>>>>>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> _______________________________________________
>> Go mailing list
>> Go at geneontology.org
>> http://fafner.stanford.edu/mailman/listinfo/go
>>
>
More information about the Go
mailing list