From jdeegan at ebi.ac.uk Thu May 1 02:55:27 2008 From: jdeegan at ebi.ac.uk (Jennifer Deegan (nee Clark)) Date: Thu, 01 May 2008 10:55:27 +0100 Subject: [Go] advocacy and outreach Message-ID: <4819938F.3030606@ebi.ac.uk> Hi, Jane and I are about to compile the outreach and user advocacy reports for April. Would it be possible to write and tell us if you have taught users about GO, represented their interests within GO, or helped bring on new annotation groups, during April? Thanks, Jen -- Jennifer Deegan (nee Clark) EMBL-European Bioinformatics Institute Gene Ontology Consortium From midori at ebi.ac.uk Thu May 1 08:36:43 2008 From: midori at ebi.ac.uk (Midori Harris) Date: Thu, 1 May 2008 16:36:43 +0100 (BST) Subject: [Go] Ontology development - April highlights Message-ID: Dear GO, The most recent monthly report on ontology content, for April 2008, is now available at: http://gocwiki.geneontology.org/index.php/Apr2008_report Ontology development highlights from April: * We've finished renaming 'sensu' terms (main wiki page: http://gocwiki.geneontology.org/index.php/Sensu_Main_Page). * Reorganization of electron transport terms has been implemented (http://wiki.geneontology.org/index.php/Electron_transport). * At the GOC meeting, we reported on pilot projects, one on electron transport and the other on glycolysis and the TCA cycle, to document links between function terms and process terms. We'll continue to collect these links for electron transport and metabolic pathways. Function-process main page: http://wiki.geneontology.org/index.php/Function-Process_Links. * An overhaul of signal transduction process terms continues (http://wiki.geneontology.org/index.php/Signaling). In May (and beyond), the function-process workwill continue, as will the signaling overhaul and quality control work connected with the regulation terms. We'll also explore regulation links within the function ontology, and between the function and process ontologies. We also have several new sets of possible cross-products that Chris has generated, and which we'll evaluate for implementation (see http://wiki.geneontology.org/index.php/Cross_Product_Guide and files in the go/scratch/xps/ directory). A lot of work is planned for metabolism terms, to fill in missing "rungs" along paths, and to make the ontology structure more consistent with ChEBI (also looking ahead to GO-ChEBI cross-products). As usual, details of small- and medium-scale changes are available in the SourceForge Curator Requests tracker. Please contact us if you want to help out with ontology work in a particular area, or if you have any comments or questions about what's going on. Ontology Development wiki: http://wiki.geneontology.org/index.php/Ontology_Development SourceForge Curator Requests tracker: https://sourceforge.net/tracker/?group_id=36855&atid=440764 Midori & David on behalf of GO's ontology developers _______________________________________________ Go mailing list Go at geneontology.org http://fafner.stanford.edu/mailman/listinfo/go From midori at ebi.ac.uk Thu May 1 09:00:19 2008 From: midori at ebi.ac.uk (midori at ebi.ac.uk) Date: Thu, 1 May 2008 16:00:19 UT Subject: [Go] SourceForge Update Message-ID: <200805011600.m41G0Jm1197116@mozart.ebi.ac.uk> An HTML attachment was scrubbed... URL: http://fafner.stanford.edu/pipermail/go/attachments/20080501/ed2df709/attachment.html -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: not available Url: http://fafner.stanford.edu/pipermail/go/attachments/20080501/ed2df709/attachment.pl From jblake at informatics.jax.org Thu May 1 09:17:11 2008 From: jblake at informatics.jax.org (Judith Blake) Date: Thu, 01 May 2008 12:17:11 -0400 Subject: [Go] Ontology development - April highlights In-Reply-To: References: Message-ID: <4819ED07.3060006@informatics.jax.org> Thanks Midori. Lots of work... Judy Midori Harris wrote: > Dear GO, > > The most recent monthly report on ontology content, for April 2008, is now > available at: > > http://gocwiki.geneontology.org/index.php/Apr2008_report > > Ontology development highlights from April: > > * We've finished renaming 'sensu' terms (main wiki page: > http://gocwiki.geneontology.org/index.php/Sensu_Main_Page). > > * Reorganization of electron transport terms has been implemented > (http://wiki.geneontology.org/index.php/Electron_transport). > > * At the GOC meeting, we reported on pilot projects, one on electron > transport and the other on glycolysis and the TCA cycle, to document links > between function terms and process terms. We'll continue to collect these > links for electron transport and metabolic pathways. Function-process main > page: http://wiki.geneontology.org/index.php/Function-Process_Links. > > * An overhaul of signal transduction process terms continues > (http://wiki.geneontology.org/index.php/Signaling). > > In May (and beyond), the function-process workwill continue, as will the > signaling overhaul and quality control work connected with the regulation > terms. We'll also explore regulation links within the function ontology, > and between the function and process ontologies. > > We also have several new sets of possible cross-products that Chris has > generated, and which we'll evaluate for implementation (see > http://wiki.geneontology.org/index.php/Cross_Product_Guide and files in > the go/scratch/xps/ directory). > > A lot of work is planned for metabolism terms, to fill in missing "rungs" > along paths, and to make the ontology structure more consistent with ChEBI > (also looking ahead to GO-ChEBI cross-products). > > As usual, details of small- and medium-scale changes are available in the > SourceForge Curator Requests tracker. Please contact us if you want to help out > with ontology work in a particular area, or if you have any comments or > questions about what's going on. > > Ontology Development wiki: > http://wiki.geneontology.org/index.php/Ontology_Development > > SourceForge Curator Requests tracker: > https://sourceforge.net/tracker/?group_id=36855&atid=440764 > > Midori & David > on behalf of GO's ontology developers > _______________________________________________ > Go mailing list > Go at geneontology.org > http://fafner.stanford.edu/mailman/listinfo/go > _______________________________________________ > Go mailing list > Go at geneontology.org > http://fafner.stanford.edu/mailman/listinfo/go > From val at sanger.ac.uk Fri May 2 08:46:18 2008 From: val at sanger.ac.uk (Valerie Wood) Date: Fri, 02 May 2008 16:46:18 +0100 Subject: [Go] Composition of the generic GO slim In-Reply-To: References: <4818A65A.8000301@ebi.ac.uk> Message-ID: <481B374A.2060309@sanger.ac.uk> I think it is good idea for the consortium to provide an official 'GO slim', and advise people how they may want to alter the slim to fit their individual purpose. A useful generic GO slim has a number of qualities (I have tried to list these below, please suggest any additional ones, I hadn't really thought before about what the rules were I used for making a slim so this is the first time I have documented them). Following the 'guidelines' below I have suggested a set of process which I think should make up the generic process slim. Perhaps we could use this as a starting point, and people can suggest additional terms (with reasons) or terms which should be removed. This provides good coverage of basic cellular processes but would need extending to cover multicellular processes. GO Slim criteria 1. The generic slim should be as organism independent as possible (although clearly some terms will not be applicable to single celled eukaryotes and some eukaryotic terms will not be applicable to prokaryotes) 2. The slim should cover AS MANY genes with annotated processes as possible 3. The slim should cover AS MANY genes with annotated processes with the smallest number of leaf node terms (if you include too many terms and it becomes too large and you start to loose the advantages of a slim). 4. It might be useful to try to avoid terms with an excessively small or large number of small number of annotations (i.e ideally your terms will not have an extreme distributions for your histogram) 5. Preferably the slim should include sibling terms with a large overlaps between them. If you choose two siblings with 200 genes annotated to each, and the majority of the annotations overlap, it is usually better to select the parent node (i.e replace 2 terms by one single term). Conversely, if the child terms of a node fall into distinct non-overlapping subsets, it might be more informative to include both child terms in your slim (see also point 7 below) 6. For most purposes you need to include a representative term for all biologically relevant processes, by including terms which are meaningful to biologists. 7. If you are using your slim for data analysis (and not just for vizualization) you need to include terms which will allow you to distinguish genes bases on their biological properties. For example, it is not good to lump all genes involved in transport under transport because the genes annotated to distinct child terms; vesicle -mediated transport, protein targeting, transmembrane transport, are VERY different in term of their i) viability ii) species distribution iii) number of interaction partners iv) copy number v) expression pattern, so it does not make sense to lump them together in your slim set. Using these criteria this is the basic cellular process eukaryotic slim I use (or slight variations of): The number of annotations (for pombe obviously) is in parentheses (protein coding only). GO:0055085 transmembrane transport (278) GO:0006913 nucleocytoplasmic transport (114) GO:0006605 protein targeting (162) GO:0016192 vesicle-mediated transport (266) GO:0051186 cofactor metabolic process (139) GO:0006766 vitamin metabolic process (57) GO:0006790 sulfur metabolic process (45) GO:0006807 nitrogen compound metabolic process (224) GO:0055086 nucleobase, nucleoside and nucleotide metabolic process (118) GO:0005975 carbohydrate metabolic process (199) GO:0006629 lipid metabolic process (201) GO:0006399 tRNA metabolic process (125) GO:0006520 amino acid metabolic process (187) GO:0006412 translation (357) GO:0006259 DNA metabolic process (296) GO:0006508 protolysis (223) GO:0005975 carbohydrate metabolic process (199) GO:0016071 mRNA metabolic process (204) GO:0043413 biopolymer glycosylation (65) possibly drop? GO:0006464 protein modification process (585) GO:0007059 chromosome segregation (186) GO:0007049 cell cycle (552) GO:0007010 cytoskeletal organization and biogenesis (236) GO:0000910 cytokinesis (145) GO:0007165 signal transduction (362) GO:0006457 protein folding (80) GO:0042254 ribosome biogenesis and assembly (223) GO:0045229 external encapsulating structure organization and biogenesis (124) GO:xxxxxxxx general transcription (see note *1 below) GO:0032569 specific transcription from RNA polymerase II promoter (102) (total 424 for all transcription) GO:0000902 cell morphogenesis (86) GO:0006338 establishment and/or maintenance of chromatin architecture (231) GO:reproductive process (182) GO:0007005 mitochondrion organization and biogenesis (251) GO:0006091 generation of precursor metabolites and energy (113) GO:0007031 peroxisome organization and biogenesis (20) At this point there are about ~100 pombe genes (out of the 3960 with an annotated process term) which aren't included in the slim I could also include.... vacuolar transport (91) reduces by 6 (most also annotated to protein targeting) telomere maintenance (54) reduces by 6 (most also annotated to DNA met) snoRNA metabolic process (10) reduces by 2 ...to improve coverage (very slightly) Finally I include GO:0006950 response to stress (444) this terms has overlaps with most other processes so is largely redundant but are useful. This leaves ~30 pombe with a process annotation unassigned to the GO slim; these are often to terms like homeostasis and its children, or otherwise uniformative terms For some purposes I would also include GO:0065007 biological regulation (1021) but I don't know if this is a good term to include in a generic slim To make this work for multicellular eukaryotes, we would probably want to add non-cellular process terms like: developmental process immune system process * Note1 it is not currently possible to retrieve genes involved in general transcription as opposed to gene specific transcription (i.e RNA I,II and III polymerases etc), with a single query. This is also important for enrichment as the genes in these 2 sets are very different in terms of species distribution, copy number and viability. I requested a grouping term for these processes a while ago and hopefully this will be implemented shortly. See: https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764 Val Ben Hitz wrote: > Emily - > I have interest in working on the generic go slim; I need it (or > something similar) to define graphics for an interaction network. > > Ben > > > On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote: > > >> Hi, >> >> From replying to a user request, I've just been having a quick look at >> the composition of the generic GO slim, and relating the GO terms >> included to the number of annotations displayed by AmiGO. >> >> Should, for instance, the 'cell recognition' term still be included in >> the generic GO slim? - it has only been annotated to 182 gene >> products, >> whereas its sibling terms: 'cell division', 'cell cycle' and 'cell >> motility', have not been included even though they (directly or >> indirectly) have been annotated to more than 1,200 gene products each. >> Similarly, the term 'cytoplasm organization and biogenesis' is in >> the GO >> slim but only has 113 gps annotated, whereas the 'membrane >> organisation >> and biogenesis' term has been annotated to 1,509 gps. >> >> I was just wondering what the goal of the generic GO slim is... if >> terms >> are selected on the basis that as many annotated gene products from >> different organisms should get mapped to descriptive GO terms before >> they are caught by the BP, MF, CC root terms (while also providing a >> full selection of terms across the whole GO vocabulary), should we >> think >> of reviewing its some of its composition in relation to overall >> annotation frequency? Or should the GO slim be kept as stable as >> possible? >> >> Cheers, >> Emily >> >> -- >> >> >> >> ------------------------------------------------------------------ >> >> Emily Dimmer Ph.D. >> GOA Coordinator >> EMBL-EBI >> Wellcome Trust Genome Campus >> Hinxton >> Cambridge CB10 1SD, U.K. >> Tel: +44 1223 494654 >> Fax: +44 1223 494468 >> email: edimmer at ebi.ac.uk >> URL: http://www.ebi.ac.uk/goa >> >> >> _______________________________________________ >> Go mailing list >> Go at geneontology.org >> http://fafner.stanford.edu/mailman/listinfo/go >> > > -- > Ben Hitz > Senior Scientific Programmer ** Saccharomyces Genome Database ** GO > Consortium > Stanford University ** hitz at genome.stanford.edu > > > > _______________________________________________ > Go mailing list > Go at geneontology.org > http://fafner.stanford.edu/mailman/listinfo/go > > > > -- --------------------------------------------------------------------------- Valerie Wood Tel: 01223 496909 S. pombe Genome Project Fax: 01223 494919 Wellcome Trust Sanger Institute email: val at sanger.ac.uk Wellcome Trust Genome Campus http://www.genedb.org/genedb/pombe Hinxton, Cambridge, CB10 1HH http://www.sanger.ac.uk/Projects/S_pombe -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From midori at ebi.ac.uk Fri May 2 09:00:21 2008 From: midori at ebi.ac.uk (midori at ebi.ac.uk) Date: Fri, 2 May 2008 16:00:21 UT Subject: [Go] SourceForge Update Message-ID: <200805021600.m42G0LD1463629@mozart.ebi.ac.uk> An HTML attachment was scrubbed... URL: http://fafner.stanford.edu/pipermail/go/attachments/20080502/64287254/attachment.html -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: not available Url: http://fafner.stanford.edu/pipermail/go/attachments/20080502/64287254/attachment.pl From jblake at informatics.jax.org Fri May 2 11:15:49 2008 From: jblake at informatics.jax.org (Judith Blake) Date: Fri, 02 May 2008 14:15:49 -0400 Subject: [Go] Composition of the generic GO slim In-Reply-To: <481B374A.2060309@sanger.ac.uk> References: <4818A65A.8000301@ebi.ac.uk> <481B374A.2060309@sanger.ac.uk> Message-ID: <481B5A55.4020001@informatics.jax.org> Val, I still maintain that users need to be able to generate grouping criteria based on their usage. I think we could go back to the fly genome paper and see the primary molecular divisions that seemed most useful to describe the genome properties. like 'reproduction' and 'metabolism'. Anything more granular is specific to the user. A mapping on this basis would likely include fewer than 20 terms and would subdivide trees. judy Valerie Wood wrote: > I think it is good idea for the consortium to provide an official 'GO > slim', and advise people how they may want to alter the slim to fit > their individual purpose. > > A useful generic GO slim has a number of qualities (I have tried to list > these below, please suggest any additional ones, I hadn't really thought > before about what the rules were I used for making a slim so this is the > first time I have documented them). Following the 'guidelines' below I > have suggested a set of process which I think should make up the generic > process slim. > > Perhaps we could use this as a starting point, and people can suggest > additional terms (with reasons) or terms which should be removed. This > provides good coverage of basic cellular processes but would need > extending to cover multicellular processes. > > GO Slim criteria > > 1. The generic slim should be as organism independent as possible > (although clearly some terms will not be applicable to single celled > eukaryotes and some eukaryotic terms will not be applicable to prokaryotes) > > 2. The slim should cover AS MANY genes with annotated processes as possible > > 3. The slim should cover AS MANY genes with annotated processes with the > smallest number of leaf node terms (if you include too many terms and it > becomes too large and you start to loose the advantages of a slim). > > 4. It might be useful to try to avoid terms with an excessively small or > large number of small number of annotations (i.e ideally your terms will > not have an extreme distributions for your histogram) > > 5. Preferably the slim should include sibling terms with a large > overlaps between them. If you choose two siblings with 200 genes > annotated to each, and the majority of the annotations overlap, it is > usually better to select the parent node (i.e replace 2 terms by one > single term). Conversely, if the child terms of a node fall into > distinct non-overlapping subsets, it might be more informative to > include both child terms in your slim (see also point 7 below) > > 6. For most purposes you need to include a representative term for all > biologically relevant processes, by including terms which are meaningful > to biologists. > > 7. If you are using your slim for data analysis (and not just for > vizualization) you need to include terms which will allow you to > distinguish genes bases on their biological properties. > For example, it is not good to lump all genes involved in transport > under transport because the genes annotated to distinct child terms; > vesicle -mediated transport, protein targeting, transmembrane transport, > are VERY different in term of their i) viability ii) species > distribution iii) number of interaction partners iv) copy number v) > expression pattern, so it does not make sense to lump > them together in your slim set. > > Using these criteria this is the basic cellular process eukaryotic slim > I use (or slight variations of): The number of annotations (for pombe > obviously) is in parentheses (protein coding only). > > GO:0055085 transmembrane transport (278) > GO:0006913 nucleocytoplasmic transport (114) > GO:0006605 protein targeting (162) > GO:0016192 vesicle-mediated transport (266) > GO:0051186 cofactor metabolic process (139) > GO:0006766 vitamin metabolic process (57) > GO:0006790 sulfur metabolic process (45) > GO:0006807 nitrogen compound metabolic process (224) > GO:0055086 nucleobase, nucleoside and nucleotide metabolic process (118) > GO:0005975 carbohydrate metabolic process (199) > GO:0006629 lipid metabolic process (201) > GO:0006399 tRNA metabolic process (125) > GO:0006520 amino acid metabolic process (187) > GO:0006412 translation (357) > GO:0006259 DNA metabolic process (296) > GO:0006508 protolysis (223) > GO:0005975 carbohydrate metabolic process (199) > GO:0016071 mRNA metabolic process (204) > GO:0043413 biopolymer glycosylation (65) possibly drop? > GO:0006464 protein modification process (585) > GO:0007059 chromosome segregation (186) > GO:0007049 cell cycle (552) > GO:0007010 cytoskeletal organization and biogenesis (236) > GO:0000910 cytokinesis (145) > GO:0007165 signal transduction (362) > GO:0006457 protein folding (80) > GO:0042254 ribosome biogenesis and assembly (223) > GO:0045229 external encapsulating structure organization and biogenesis > (124) > GO:xxxxxxxx general transcription (see note *1 below) > GO:0032569 specific transcription from RNA polymerase II promoter (102) > (total 424 for all transcription) > GO:0000902 cell morphogenesis (86) > GO:0006338 establishment and/or maintenance of chromatin architecture (231) > GO:reproductive process (182) > GO:0007005 mitochondrion organization and biogenesis (251) > GO:0006091 generation of precursor metabolites and energy (113) > GO:0007031 peroxisome organization and biogenesis (20) > > At this point there are about ~100 pombe genes (out of the 3960 with an > annotated process term) which aren't included in the slim > > I could also include.... > vacuolar transport (91) reduces by 6 (most also annotated to protein > targeting) > telomere maintenance (54) reduces by 6 (most also annotated to DNA met) > snoRNA metabolic process (10) reduces by 2 > ...to improve coverage (very slightly) > > Finally I include > GO:0006950 response to stress (444) > this terms has overlaps with most other processes so is largely > redundant but are useful. > > This leaves ~30 pombe with a process annotation unassigned to the GO > slim; these are often to terms like homeostasis and its children, or > otherwise uniformative terms > > For some purposes I would also include > GO:0065007 biological regulation (1021) > but I don't know if this is a good term to include in a generic slim > > To make this work for multicellular eukaryotes, we would probably > want to add non-cellular process terms like: > > developmental process > immune system process > > > * Note1 it is not currently possible to retrieve genes involved in > general transcription as opposed to gene specific transcription (i.e RNA > I,II and III polymerases etc), with a single query. This is also > important for enrichment as the genes in these 2 sets are very different > in terms of species distribution, copy number and viability. I requested > a grouping term for these processes a while ago and hopefully this will > be implemented shortly. > > See: > https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764 > > > Val > > > > > > > Ben Hitz wrote: > >> Emily - >> I have interest in working on the generic go slim; I need it (or >> something similar) to define graphics for an interaction network. >> >> Ben >> >> >> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote: >> >> >> >>> Hi, >>> >>> From replying to a user request, I've just been having a quick look at >>> the composition of the generic GO slim, and relating the GO terms >>> included to the number of annotations displayed by AmiGO. >>> >>> Should, for instance, the 'cell recognition' term still be included in >>> the generic GO slim? - it has only been annotated to 182 gene >>> products, >>> whereas its sibling terms: 'cell division', 'cell cycle' and 'cell >>> motility', have not been included even though they (directly or >>> indirectly) have been annotated to more than 1,200 gene products each. >>> Similarly, the term 'cytoplasm organization and biogenesis' is in >>> the GO >>> slim but only has 113 gps annotated, whereas the 'membrane >>> organisation >>> and biogenesis' term has been annotated to 1,509 gps. >>> >>> I was just wondering what the goal of the generic GO slim is... if >>> terms >>> are selected on the basis that as many annotated gene products from >>> different organisms should get mapped to descriptive GO terms before >>> they are caught by the BP, MF, CC root terms (while also providing a >>> full selection of terms across the whole GO vocabulary), should we >>> think >>> of reviewing its some of its composition in relation to overall >>> annotation frequency? Or should the GO slim be kept as stable as >>> possible? >>> >>> Cheers, >>> Emily >>> >>> -- >>> >>> >>> >>> ------------------------------------------------------------------ >>> >>> Emily Dimmer Ph.D. >>> GOA Coordinator >>> EMBL-EBI >>> Wellcome Trust Genome Campus >>> Hinxton >>> Cambridge CB10 1SD, U.K. >>> Tel: +44 1223 494654 >>> Fax: +44 1223 494468 >>> email: edimmer at ebi.ac.uk >>> URL: http://www.ebi.ac.uk/goa >>> >>> >>> _______________________________________________ >>> Go mailing list >>> Go at geneontology.org >>> http://fafner.stanford.edu/mailman/listinfo/go >>> >>> >> -- >> Ben Hitz >> Senior Scientific Programmer ** Saccharomyces Genome Database ** GO >> Consortium >> Stanford University ** hitz at genome.stanford.edu >> >> >> >> _______________________________________________ >> Go mailing list >> Go at geneontology.org >> http://fafner.stanford.edu/mailman/listinfo/go >> >> >> >> >> > > > From val at sanger.ac.uk Mon May 5 09:51:02 2008 From: val at sanger.ac.uk (Valerie Wood) Date: Mon, 05 May 2008 17:51:02 +0100 Subject: [Go] Composition of the generic GO slim In-Reply-To: <481B5A55.4020001@informatics.jax.org> References: <4818A65A.8000301@ebi.ac.uk> <481B374A.2060309@sanger.ac.uk> <481B5A55.4020001@informatics.jax.org> Message-ID: <481F3AF6.4040700@sanger.ac.uk> Judy, You are correct that no one slim is going to fit all organisms or all uses. However it isn't simple to create an informative slim which gives complete (or nearly complete) coverage of all of an organisms annotations (and complete coverage of the annotation space is an important feature of a robust slim). Does the drosophila slim set cover all of the annotated genes? The slim I suggested will give complete coverage for single-celled eukaryotes (it may need additional high level terms to cover muliticellular eukaryotes). This particular slim is useful for evaluating an organisms "cell biology". Perhaps a very generic slim, which only includes very high level terms would be useful multicellular organisms, but it would not be so useful for single-celled organisms. One suggested criteria (6 in previou) suggested that terms be meaningful to biologists. What I meant here was that the terms should be was that the terms should be 'biologically informative'. For cellular roles, or for a single-celled organism 'metabolism isn't so useful as a 'direct' slim term ( metabolic processes include transcription, translation, DNA replication, mRNA processing etc., in addition to primary and secondary metabolism). For pombe 3102 of 4194 process annotated gene products are annotated to metabolism, so this term in a slim does not tell you very much. In addition, if metabolism is included as a 'direct' slim term, and you have a gene product which is annotated ONLY to "metabolic process" then you really know very little about its biological role. This can occur as frequently as it is possible to predict that a protein has catalytic activity, and is involved in a 'metabolic process' but not to say anything more specific; there are many direct Interpro mappings to these two terms. If I was trying to assess the 'real biological roles' of my organisms gene products, I would wish to exclude direct annotations to 'metabolic process' from the slim. A GO slim provides a mechanism to filter out annotations to high level relatively uninformative (with respect to the biological role) nodes like 'metabolism, cellular process, localization' (in the slim, they will be annotated to 'unknown' if there is no annotation to one of your slim terms or their children). Once you exclude a term like metabolism it becomes necessary to include all of the child terms (or a combination of child terms ) to give complete coverage of the parent term ( NOTE: once the slimmed terms are mapped to the slim ontology the high level terms will be included, but their totals will only reflect the total of the gene products annotated via the terms in the slim). The difficult part is in building a slim is identifying the set of terms which provides complete coverage; this is the tricky step for most biologists, who are not so familiar with the ontologies. It would be useful to provide a starting slim which gives complete coverage of all annotations (using biologically relevant terms for common applications) which they can change as necessary. Maybe we should provide a set of 'complete coverage' slims for common applications. i.e. suitable for multicellular organisms and very general biological roles suitable for single-celled eukaryotes, or evaluating basic cellular processes Val Judith Blake wrote: > Val, > I still maintain that users need to be able to generate grouping > criteria based on their usage. I think we could go back to the fly > genome paper and see the primary molecular divisions that seemed most > useful to describe the genome properties. like 'reproduction' and > 'metabolism'. Anything more granular is specific to the user. A > mapping on this basis would likely include fewer than 20 terms and > would subdivide trees. > > judy > > Valerie Wood wrote: >> I think it is good idea for the consortium to provide an official 'GO >> slim', and advise people how they may want to alter the slim to fit >> their individual purpose. >> >> A useful generic GO slim has a number of qualities (I have tried to >> list these below, please suggest any additional ones, I hadn't really >> thought before about what the rules were I used for making a slim so >> this is the first time I have documented them). Following the >> 'guidelines' below I have suggested a set of process which I think >> should make up the generic process slim. >> >> Perhaps we could use this as a starting point, and people can suggest >> additional terms (with reasons) or terms which should be removed. >> This provides good coverage of basic cellular processes but would >> need extending to cover multicellular processes. >> >> GO Slim criteria >> >> 1. The generic slim should be as organism independent as possible >> (although clearly some terms will not be applicable to single celled >> eukaryotes and some eukaryotic terms will not be applicable to >> prokaryotes) >> >> 2. The slim should cover AS MANY genes with annotated processes as >> possible >> >> 3. The slim should cover AS MANY genes with annotated processes with >> the smallest number of leaf node terms (if you include too many terms >> and it becomes too large and you start to loose the advantages of a >> slim). >> >> 4. It might be useful to try to avoid terms with an excessively small >> or large number of small number of annotations (i.e ideally your >> terms will not have an extreme distributions for your histogram) >> >> 5. Preferably the slim should include sibling terms with a large >> overlaps between them. If you choose two siblings with 200 genes >> annotated to each, and the majority of the annotations overlap, it >> is usually better to select the parent node (i.e replace 2 terms by >> one single term). Conversely, if the child terms of a node fall into >> distinct non-overlapping subsets, it might be more informative to >> include both child terms in your slim (see also point 7 below) >> >> 6. For most purposes you need to include a representative term for >> all biologically relevant processes, by including terms which are >> meaningful to biologists. >> >> 7. If you are using your slim for data analysis (and not just for >> vizualization) you need to include terms which will allow you to >> distinguish genes bases on their biological properties. >> For example, it is not good to lump all genes involved in transport >> under transport because the genes annotated to distinct child terms; >> vesicle -mediated transport, protein targeting, transmembrane >> transport, are VERY different in term of their i) viability ii) >> species distribution iii) number of interaction partners iv) copy >> number v) expression pattern, so it does not make sense to lump them >> together in your slim set. >> >> Using these criteria this is the basic cellular process eukaryotic >> slim I use (or slight variations of): The number of annotations (for >> pombe obviously) is in parentheses (protein coding only). >> >> GO:0055085 transmembrane transport (278) >> GO:0006913 nucleocytoplasmic transport (114) >> GO:0006605 protein targeting (162) >> GO:0016192 vesicle-mediated transport (266) >> GO:0051186 cofactor metabolic process (139) >> GO:0006766 vitamin metabolic process (57) >> GO:0006790 sulfur metabolic process (45) >> GO:0006807 nitrogen compound metabolic process (224) >> GO:0055086 nucleobase, nucleoside and nucleotide metabolic process (118) >> GO:0005975 carbohydrate metabolic process (199) >> GO:0006629 lipid metabolic process (201) >> GO:0006399 tRNA metabolic process (125) >> GO:0006520 amino acid metabolic process (187) >> GO:0006412 translation (357) >> GO:0006259 DNA metabolic process (296) >> GO:0006508 protolysis (223) >> GO:0005975 carbohydrate metabolic process (199) >> GO:0016071 mRNA metabolic process (204) >> GO:0043413 biopolymer glycosylation (65) possibly drop? >> GO:0006464 protein modification process (585) >> GO:0007059 chromosome segregation (186) >> GO:0007049 cell cycle (552) >> GO:0007010 cytoskeletal organization and biogenesis (236) >> GO:0000910 cytokinesis (145) >> GO:0007165 signal transduction (362) >> GO:0006457 protein folding (80) >> GO:0042254 ribosome biogenesis and assembly (223) >> GO:0045229 external encapsulating structure organization and >> biogenesis (124) >> GO:xxxxxxxx general transcription (see note *1 below) >> GO:0032569 specific transcription from RNA polymerase II promoter (102) >> (total 424 for all transcription) >> GO:0000902 cell morphogenesis (86) >> GO:0006338 establishment and/or maintenance of chromatin architecture >> (231) >> GO:reproductive process (182) >> GO:0007005 mitochondrion organization and biogenesis (251) >> GO:0006091 generation of precursor metabolites and energy (113) >> GO:0007031 peroxisome organization and biogenesis (20) >> >> At this point there are about ~100 pombe genes (out of the 3960 with >> an annotated process term) which aren't included in the slim >> >> I could also include.... >> vacuolar transport (91) reduces by 6 (most also annotated to protein >> targeting) >> telomere maintenance (54) reduces by 6 (most also annotated to DNA met) >> snoRNA metabolic process (10) reduces by 2 >> ...to improve coverage (very slightly) >> >> Finally I include >> GO:0006950 response to stress (444) >> this terms has overlaps with most other processes so is largely >> redundant but are useful. >> >> This leaves ~30 pombe with a process annotation unassigned to the GO >> slim; these are often to terms like homeostasis and its children, or >> otherwise uniformative terms >> >> For some purposes I would also include >> GO:0065007 biological regulation (1021) >> but I don't know if this is a good term to include in a generic slim >> >> To make this work for multicellular eukaryotes, we would probably >> want to add non-cellular process terms like: >> >> developmental process >> immune system process >> >> >> * Note1 it is not currently possible to retrieve genes involved in >> general transcription as opposed to gene specific transcription (i.e >> RNA I,II and III polymerases etc), with a single query. This is also >> important for enrichment as the genes in these 2 sets are very >> different in terms of species distribution, copy number and >> viability. I requested a grouping term for these processes a while >> ago and hopefully this will be implemented shortly. >> >> See: >> https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764 >> >> >> >> Val >> >> >> >> >> >> >> Ben Hitz wrote: >> >>> Emily - >>> I have interest in working on the generic go slim; I need it (or >>> something similar) to define graphics for an interaction network. >>> >>> Ben >>> >>> >>> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote: >>> >>> >>>> Hi, >>>> >>>> From replying to a user request, I've just been having a quick look at >>>> the composition of the generic GO slim, and relating the GO terms >>>> included to the number of annotations displayed by AmiGO. >>>> >>>> Should, for instance, the 'cell recognition' term still be included in >>>> the generic GO slim? - it has only been annotated to 182 gene >>>> products, >>>> whereas its sibling terms: 'cell division', 'cell cycle' and 'cell >>>> motility', have not been included even though they (directly or >>>> indirectly) have been annotated to more than 1,200 gene products each. >>>> Similarly, the term 'cytoplasm organization and biogenesis' is in >>>> the GO >>>> slim but only has 113 gps annotated, whereas the 'membrane >>>> organisation >>>> and biogenesis' term has been annotated to 1,509 gps. >>>> >>>> I was just wondering what the goal of the generic GO slim is... if >>>> terms >>>> are selected on the basis that as many annotated gene products from >>>> different organisms should get mapped to descriptive GO terms before >>>> they are caught by the BP, MF, CC root terms (while also providing a >>>> full selection of terms across the whole GO vocabulary), should we >>>> think >>>> of reviewing its some of its composition in relation to overall >>>> annotation frequency? Or should the GO slim be kept as stable as >>>> possible? >>>> >>>> Cheers, >>>> Emily >>>> >>>> -- >>>> >>>> >>>> >>>> ------------------------------------------------------------------ >>>> >>>> Emily Dimmer Ph.D. >>>> GOA Coordinator >>>> EMBL-EBI >>>> Wellcome Trust Genome Campus >>>> Hinxton >>>> Cambridge CB10 1SD, U.K. >>>> Tel: +44 1223 494654 >>>> Fax: +44 1223 494468 >>>> email: edimmer at ebi.ac.uk >>>> URL: http://www.ebi.ac.uk/goa >>>> >>>> >>>> _______________________________________________ >>>> Go mailing list >>>> Go at geneontology.org >>>> http://fafner.stanford.edu/mailman/listinfo/go >>>> >>> -- >>> Ben Hitz >>> Senior Scientific Programmer ** Saccharomyces Genome Database ** GO >>> Consortium >>> Stanford University ** hitz at genome.stanford.edu >>> >>> >>> >>> _______________________________________________ >>> Go mailing list >>> Go at geneontology.org >>> http://fafner.stanford.edu/mailman/listinfo/go >>> >>> >>> >>> >> >> >> > > > -- --------------------------------------------------------------------------- Valerie Wood Tel: 01223 496909 S. pombe Genome Project Fax: 01223 494919 Wellcome Trust Sanger Institute email: val at sanger.ac.uk Wellcome Trust Genome Campus http://www.genedb.org/genedb/pombe Hinxton, Cambridge, CB10 1HH http://www.sanger.ac.uk/Projects/S_pombe -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jblake at informatics.jax.org Mon May 5 10:41:15 2008 From: jblake at informatics.jax.org (Judith Blake) Date: Mon, 05 May 2008 13:41:15 -0400 Subject: [Go] Composition of the generic GO slim In-Reply-To: <481F3AF6.4040700@sanger.ac.uk> References: <4818A65A.8000301@ebi.ac.uk> <481B374A.2060309@sanger.ac.uk> <481B5A55.4020001@informatics.jax.org> <481F3AF6.4040700@sanger.ac.uk> Message-ID: <481F46BB.5080801@informatics.jax.org> Val, My point really is that experiments are done in context. A person studying metabolism may want to break out those terms by particular sub-divisions and lump other things. One of the roles of collaborating GO people would be to add in the construction of particular slims if requested. For example, when I have done this, the researcher provided a list of 12-16 subdivisions that made sense for their purpose, and we constructed a GO_slim that subdivided the GO appropriately. I think of it as part of the data analysis process. A researcher using a generic GO_slim without understanding the vagaries of the annotations or of the ontology subtrees will neither understand the results. my opinion. judy Valerie Wood wrote: > Judy, > > You are correct that no one slim is going to fit all organisms or all > uses. > However it isn't simple to create an informative slim which gives > complete > (or nearly complete) coverage of all of an organisms annotations (and > complete > coverage of the annotation space is an important feature > of a robust slim). Does the drosophila slim set cover all of the > annotated genes? > > The slim I suggested will give complete coverage for single-celled > eukaryotes (it may need additional high level terms to cover > muliticellular eukaryotes). This particular slim is useful for evaluating > an organisms "cell biology". Perhaps a very generic slim, which only > includes > very high level terms would be useful multicellular organisms, > but it would not be so useful for single-celled organisms. > > One suggested criteria (6 in previou) suggested that terms be > meaningful to biologists. > What I meant here was that the terms should be was that the terms should > be 'biologically informative'. For cellular roles, or for a single-celled > organism 'metabolism isn't so useful as a 'direct' slim term ( > metabolic processes > include transcription, translation, DNA replication, mRNA processing > etc., > in addition to primary and secondary metabolism). For pombe 3102 of > 4194 process annotated gene products are annotated to metabolism, > so this term in a slim does not tell you very much. > > In addition, if metabolism is included as a 'direct' slim term, and > you have a gene product > which is annotated ONLY to "metabolic process" then you really know very > little about its biological role. This can occur as frequently as it > is possible to > predict that a protein has catalytic activity, and is involved in a > 'metabolic process' > but not to say anything more specific; there are many direct Interpro > mappings > to these two terms. If I was trying to assess the 'real biological > roles' of my organisms > gene products, I would wish to exclude direct annotations to > 'metabolic process' from the slim. > > A GO slim provides a mechanism to filter out annotations to high level > relatively uninformative (with respect to the biological role) nodes > like > 'metabolism, cellular process, localization' (in the slim, they will > be annotated > to 'unknown' if there is no annotation to one of your slim terms or > their children). > > Once you exclude a term like metabolism it becomes necessary > to include all of the child terms (or a combination of child terms ) > to give complete > coverage of the parent term ( NOTE: once the slimmed terms are mapped > to the slim ontology the high level terms will be > included, but their totals will only reflect the total of the gene > products > annotated via the terms in the slim). > > The difficult part is in building a slim is identifying the set of > terms which > provides complete coverage; this is the tricky step for most biologists, > who are not so familiar with the ontologies. It would be useful to > provide a > starting slim which gives complete coverage of all annotations (using > biologically relevant terms for common applications) which they can > change as necessary. > Maybe we should provide a set of 'complete coverage' slims for common > applications. > > i.e. > suitable for multicellular organisms and very general biological roles > suitable for single-celled eukaryotes, or evaluating basic cellular > processes > > Val > > > > > Judith Blake wrote: >> Val, >> I still maintain that users need to be able to generate grouping >> criteria based on their usage. I think we could go back to the fly >> genome paper and see the primary molecular divisions that seemed most >> useful to describe the genome properties. like 'reproduction' and >> 'metabolism'. Anything more granular is specific to the user. A >> mapping on this basis would likely include fewer than 20 terms and >> would subdivide trees. >> >> judy >> >> Valerie Wood wrote: >>> I think it is good idea for the consortium to provide an official >>> 'GO slim', and advise people how they may want to alter the slim to >>> fit their individual purpose. >>> >>> A useful generic GO slim has a number of qualities (I have tried to >>> list these below, please suggest any additional ones, I hadn't >>> really thought before about what the rules were I used for making a >>> slim so this is the first time I have documented them). Following >>> the 'guidelines' below I have suggested a set of process which I >>> think should make up the generic process slim. >>> >>> Perhaps we could use this as a starting point, and people can >>> suggest additional terms (with reasons) or terms which should be >>> removed. This provides good coverage of basic cellular processes but >>> would need extending to cover multicellular processes. >>> >>> GO Slim criteria >>> >>> 1. The generic slim should be as organism independent as possible >>> (although clearly some terms will not be applicable to single celled >>> eukaryotes and some eukaryotic terms will not be applicable to >>> prokaryotes) >>> >>> 2. The slim should cover AS MANY genes with annotated processes as >>> possible >>> >>> 3. The slim should cover AS MANY genes with annotated processes with >>> the smallest number of leaf node terms (if you include too many >>> terms and it becomes too large and you start to loose the advantages >>> of a slim). >>> >>> 4. It might be useful to try to avoid terms with an excessively >>> small or large number of small number of annotations (i.e ideally >>> your terms will not have an extreme distributions for your histogram) >>> >>> 5. Preferably the slim should include sibling terms with a large >>> overlaps between them. If you choose two siblings with 200 genes >>> annotated to each, and the majority of the annotations overlap, it >>> is usually better to select the parent node (i.e replace 2 terms by >>> one single term). Conversely, if the child terms of a node fall >>> into distinct non-overlapping subsets, it might be more informative >>> to include both child terms in your slim (see also point 7 below) >>> >>> 6. For most purposes you need to include a representative term for >>> all biologically relevant processes, by including terms which are >>> meaningful to biologists. >>> >>> 7. If you are using your slim for data analysis (and not just for >>> vizualization) you need to include terms which will allow you to >>> distinguish genes bases on their biological properties. >>> For example, it is not good to lump all genes involved in transport >>> under transport because the genes annotated to distinct child terms; >>> vesicle -mediated transport, protein targeting, transmembrane >>> transport, are VERY different in term of their i) viability ii) >>> species distribution iii) number of interaction partners iv) copy >>> number v) expression pattern, so it does not make sense to lump them >>> together in your slim set. >>> >>> Using these criteria this is the basic cellular process eukaryotic >>> slim I use (or slight variations of): The number of annotations (for >>> pombe obviously) is in parentheses (protein coding only). >>> >>> GO:0055085 transmembrane transport (278) >>> GO:0006913 nucleocytoplasmic transport (114) >>> GO:0006605 protein targeting (162) >>> GO:0016192 vesicle-mediated transport (266) >>> GO:0051186 cofactor metabolic process (139) >>> GO:0006766 vitamin metabolic process (57) >>> GO:0006790 sulfur metabolic process (45) >>> GO:0006807 nitrogen compound metabolic process (224) >>> GO:0055086 nucleobase, nucleoside and nucleotide metabolic process >>> (118) >>> GO:0005975 carbohydrate metabolic process (199) >>> GO:0006629 lipid metabolic process (201) >>> GO:0006399 tRNA metabolic process (125) >>> GO:0006520 amino acid metabolic process (187) >>> GO:0006412 translation (357) >>> GO:0006259 DNA metabolic process (296) >>> GO:0006508 protolysis (223) >>> GO:0005975 carbohydrate metabolic process (199) >>> GO:0016071 mRNA metabolic process (204) >>> GO:0043413 biopolymer glycosylation (65) possibly drop? >>> GO:0006464 protein modification process (585) >>> GO:0007059 chromosome segregation (186) >>> GO:0007049 cell cycle (552) >>> GO:0007010 cytoskeletal organization and biogenesis (236) >>> GO:0000910 cytokinesis (145) >>> GO:0007165 signal transduction (362) >>> GO:0006457 protein folding (80) >>> GO:0042254 ribosome biogenesis and assembly (223) >>> GO:0045229 external encapsulating structure organization and >>> biogenesis (124) >>> GO:xxxxxxxx general transcription (see note *1 below) >>> GO:0032569 specific transcription from RNA polymerase II promoter (102) >>> (total 424 for all transcription) >>> GO:0000902 cell morphogenesis (86) >>> GO:0006338 establishment and/or maintenance of chromatin >>> architecture (231) >>> GO:reproductive process (182) >>> GO:0007005 mitochondrion organization and biogenesis (251) >>> GO:0006091 generation of precursor metabolites and energy (113) >>> GO:0007031 peroxisome organization and biogenesis (20) >>> >>> At this point there are about ~100 pombe genes (out of the 3960 with >>> an annotated process term) which aren't included in the slim >>> >>> I could also include.... >>> vacuolar transport (91) reduces by 6 (most also annotated to protein >>> targeting) >>> telomere maintenance (54) reduces by 6 (most also annotated to DNA met) >>> snoRNA metabolic process (10) reduces by 2 >>> ...to improve coverage (very slightly) >>> >>> Finally I include >>> GO:0006950 response to stress (444) >>> this terms has overlaps with most other processes so is largely >>> redundant but are useful. >>> >>> This leaves ~30 pombe with a process annotation unassigned to the >>> GO slim; these are often to terms like homeostasis and its children, >>> or otherwise uniformative terms >>> >>> For some purposes I would also include >>> GO:0065007 biological regulation (1021) >>> but I don't know if this is a good term to include in a generic slim >>> >>> To make this work for multicellular eukaryotes, we would probably >>> want to add non-cellular process terms like: >>> >>> developmental process >>> immune system process >>> >>> >>> * Note1 it is not currently possible to retrieve genes involved in >>> general transcription as opposed to gene specific transcription (i.e >>> RNA I,II and III polymerases etc), with a single query. This is >>> also important for enrichment as the genes in these 2 sets are very >>> different in terms of species distribution, copy number and >>> viability. I requested a grouping term for these processes a while >>> ago and hopefully this will be implemented shortly. >>> >>> See: >>> https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764 >>> >>> >>> >>> Val >>> >>> >>> >>> >>> >>> >>> Ben Hitz wrote: >>> >>>> Emily - >>>> I have interest in working on the generic go slim; I need it (or >>>> something similar) to define graphics for an interaction network. >>>> >>>> Ben >>>> >>>> >>>> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote: >>>> >>>> >>>>> Hi, >>>>> >>>>> From replying to a user request, I've just been having a quick >>>>> look at >>>>> the composition of the generic GO slim, and relating the GO terms >>>>> included to the number of annotations displayed by AmiGO. >>>>> >>>>> Should, for instance, the 'cell recognition' term still be >>>>> included in >>>>> the generic GO slim? - it has only been annotated to 182 gene >>>>> products, >>>>> whereas its sibling terms: 'cell division', 'cell cycle' and 'cell >>>>> motility', have not been included even though they (directly or >>>>> indirectly) have been annotated to more than 1,200 gene products >>>>> each. >>>>> Similarly, the term 'cytoplasm organization and biogenesis' is in >>>>> the GO >>>>> slim but only has 113 gps annotated, whereas the 'membrane >>>>> organisation >>>>> and biogenesis' term has been annotated to 1,509 gps. >>>>> >>>>> I was just wondering what the goal of the generic GO slim is... >>>>> if terms >>>>> are selected on the basis that as many annotated gene products from >>>>> different organisms should get mapped to descriptive GO terms before >>>>> they are caught by the BP, MF, CC root terms (while also providing a >>>>> full selection of terms across the whole GO vocabulary), should >>>>> we think >>>>> of reviewing its some of its composition in relation to overall >>>>> annotation frequency? Or should the GO slim be kept as stable as >>>>> possible? >>>>> >>>>> Cheers, >>>>> Emily >>>>> >>>>> -- >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------ >>>>> >>>>> Emily Dimmer Ph.D. >>>>> GOA Coordinator >>>>> EMBL-EBI >>>>> Wellcome Trust Genome Campus >>>>> Hinxton >>>>> Cambridge CB10 1SD, U.K. >>>>> Tel: +44 1223 494654 >>>>> Fax: +44 1223 494468 >>>>> email: edimmer at ebi.ac.uk >>>>> URL: http://www.ebi.ac.uk/goa >>>>> >>>>> >>>>> _______________________________________________ >>>>> Go mailing list >>>>> Go at geneontology.org >>>>> http://fafner.stanford.edu/mailman/listinfo/go >>>>> >>>> -- >>>> Ben Hitz >>>> Senior Scientific Programmer ** Saccharomyces Genome Database ** >>>> GO Consortium >>>> Stanford University ** hitz at genome.stanford.edu >>>> >>>> >>>> >>>> _______________________________________________ >>>> Go mailing list >>>> Go at geneontology.org >>>> http://fafner.stanford.edu/mailman/listinfo/go >>>> >>>> >>>> >>>> >>> >>> >>> >> >> >> > > From val at sanger.ac.uk Mon May 5 11:01:28 2008 From: val at sanger.ac.uk (Valerie Wood) Date: Mon, 5 May 2008 18:01:28 UT Subject: [Go] Composition of the generic GO slim Message-ID: An embedded and charset-unspecified text was scrubbed... Name: not available Url: http://fafner.stanford.edu/pipermail/go/attachments/20080505/0c5b5f34/attachment.ksh From jblake at informatics.jax.org Mon May 5 11:40:36 2008 From: jblake at informatics.jax.org (Judith Blake) Date: Mon, 05 May 2008 14:40:36 -0400 Subject: [Go] Composition of the generic GO slim In-Reply-To: References: Message-ID: <481F54A4.50504@informatics.jax.org> agreed, we should remove or change the text to reflect reality. judy Valerie Wood wrote: > The GO website makes the following statement, which is a bit misleading if we don't intend to provide any comprehensive slims....(as Emily pointed out earlier in this thread, this isn't a comprehensive slim....) > > "GO provides a generic GO slim which, like the GO itself, is not species specific, and which should be suitable for most purposes. > > So maybe this slim should not be decribed as such? > > > > > Judith Blake wrote: > >> Val, >> My point really is that experiments are done in context. A person >> studying metabolism may want to break out those terms by particular >> sub-divisions and lump other things. One of the roles of collaborating >> GO people would be to add in the construction of particular slims if >> requested. >> >> For example, when I have done this, the researcher provided a list of >> 12-16 subdivisions that made sense for their purpose, and we constructed >> a GO_slim that subdivided the GO appropriately. I think of it as part >> of the data analysis process. A researcher using a generic GO_slim >> without understanding the vagaries of the annotations or of the ontology >> subtrees will neither understand the results. >> >> my opinion. >> judy >> >> Valerie Wood wrote: >> >>> Judy, >>> >>> You are correct that no one slim is going to fit all organisms or all >>> uses. >>> However it isn't simple to create an informative slim which gives >>> complete >>> (or nearly complete) coverage of all of an organisms annotations (and >>> complete >>> coverage of the annotation space is an important feature >>> of a robust slim). Does the drosophila slim set cover all of the >>> annotated genes? >>> >>> The slim I suggested will give complete coverage for single-celled >>> eukaryotes (it may need additional high level terms to cover >>> muliticellular eukaryotes). This particular slim is useful for evaluating >>> an organisms "cell biology". Perhaps a very generic slim, which only >>> includes >>> very high level terms would be useful multicellular organisms, >>> but it would not be so useful for single-celled organisms. >>> >>> One suggested criteria (6 in previou) suggested that terms be >>> meaningful to biologists. >>> What I meant here was that the terms should be was that the terms should >>> be 'biologically informative'. For cellular roles, or for a single-celled >>> organism 'metabolism isn't so useful as a 'direct' slim term ( >>> metabolic processes >>> include transcription, translation, DNA replication, mRNA processing >>> etc., >>> in addition to primary and secondary metabolism). For pombe 3102 of >>> 4194 process annotated gene products are annotated to metabolism, >>> so this term in a slim does not tell you very much. >>> >>> In addition, if metabolism is included as a 'direct' slim term, and >>> you have a gene product >>> which is annotated ONLY to "metabolic process" then you really know very >>> little about its biological role. This can occur as frequently as it >>> is possible to >>> predict that a protein has catalytic activity, and is involved in a >>> 'metabolic process' >>> but not to say anything more specific; there are many direct Interpro >>> mappings >>> to these two terms. If I was trying to assess the 'real biological >>> roles' of my organisms >>> gene products, I would wish to exclude direct annotations to >>> 'metabolic process' from the slim. >>> >>> A GO slim provides a mechanism to filter out annotations to high level >>> relatively uninformative (with respect to the biological role) nodes >>> like >>> 'metabolism, cellular process, localization' (in the slim, they will >>> be annotated >>> to 'unknown' if there is no annotation to one of your slim terms or >>> their children). >>> >>> Once you exclude a term like metabolism it becomes necessary >>> to include all of the child terms (or a combination of child terms ) >>> to give complete >>> coverage of the parent term ( NOTE: once the slimmed terms are mapped >>> to the slim ontology the high level terms will be >>> included, but their totals will only reflect the total of the gene >>> products >>> annotated via the terms in the slim). >>> >>> The difficult part is in building a slim is identifying the set of >>> terms which >>> provides complete coverage; this is the tricky step for most biologists, >>> who are not so familiar with the ontologies. It would be useful to >>> provide a >>> starting slim which gives complete coverage of all annotations (using >>> biologically relevant terms for common applications) which they can >>> change as necessary. >>> Maybe we should provide a set of 'complete coverage' slims for common >>> applications. >>> >>> i.e. >>> suitable for multicellular organisms and very general biological roles >>> suitable for single-celled eukaryotes, or evaluating basic cellular >>> processes >>> >>> Val >>> >>> >>> >>> >>> Judith Blake wrote: >>> >>>> Val, >>>> I still maintain that users need to be able to generate grouping >>>> criteria based on their usage. I think we could go back to the fly >>>> genome paper and see the primary molecular divisions that seemed most >>>> useful to describe the genome properties. like 'reproduction' and >>>> 'metabolism'. Anything more granular is specific to the user. A >>>> mapping on this basis would likely include fewer than 20 terms and >>>> would subdivide trees. >>>> >>>> judy >>>> >>>> Valerie Wood wrote: >>>> >>>>> I think it is good idea for the consortium to provide an official >>>>> 'GO slim', and advise people how they may want to alter the slim to >>>>> fit their individual purpose. >>>>> >>>>> A useful generic GO slim has a number of qualities (I have tried to >>>>> list these below, please suggest any additional ones, I hadn't >>>>> really thought before about what the rules were I used for making a >>>>> slim so this is the first time I have documented them). Following >>>>> the 'guidelines' below I have suggested a set of process which I >>>>> think should make up the generic process slim. >>>>> >>>>> Perhaps we could use this as a starting point, and people can >>>>> suggest additional terms (with reasons) or terms which should be >>>>> removed. This provides good coverage of basic cellular processes but >>>>> would need extending to cover multicellular processes. >>>>> >>>>> GO Slim criteria >>>>> >>>>> 1. The generic slim should be as organism independent as possible >>>>> (although clearly some terms will not be applicable to single celled >>>>> eukaryotes and some eukaryotic terms will not be applicable to >>>>> prokaryotes) >>>>> >>>>> 2. The slim should cover AS MANY genes with annotated processes as >>>>> possible >>>>> >>>>> 3. The slim should cover AS MANY genes with annotated processes with >>>>> the smallest number of leaf node terms (if you include too many >>>>> terms and it becomes too large and you start to loose the advantages >>>>> of a slim). >>>>> >>>>> 4. It might be useful to try to avoid terms with an excessively >>>>> small or large number of small number of annotations (i.e ideally >>>>> your terms will not have an extreme distributions for your histogram) >>>>> >>>>> 5. Preferably the slim should include sibling terms with a large >>>>> overlaps between them. If you choose two siblings with 200 genes >>>>> annotated to each, and the majority of the annotations overlap, it >>>>> is usually better to select the parent node (i.e replace 2 terms by >>>>> one single term). Conversely, if the child terms of a node fall >>>>> into distinct non-overlapping subsets, it might be more informative >>>>> to include both child terms in your slim (see also point 7 below) >>>>> >>>>> 6. For most purposes you need to include a representative term for >>>>> all biologically relevant processes, by including terms which are >>>>> meaningful to biologists. >>>>> >>>>> 7. If you are using your slim for data analysis (and not just for >>>>> vizualization) you need to include terms which will allow you to >>>>> distinguish genes bases on their biological properties. >>>>> For example, it is not good to lump all genes involved in transport >>>>> under transport because the genes annotated to distinct child terms; >>>>> vesicle -mediated transport, protein targeting, transmembrane >>>>> transport, are VERY different in term of their i) viability ii) >>>>> species distribution iii) number of interaction partners iv) copy >>>>> number v) expression pattern, so it does not make sense to lump them >>>>> together in your slim set. >>>>> >>>>> Using these criteria this is the basic cellular process eukaryotic >>>>> slim I use (or slight variations of): The number of annotations (for >>>>> pombe obviously) is in parentheses (protein coding only). >>>>> >>>>> GO:0055085 transmembrane transport (278) >>>>> GO:0006913 nucleocytoplasmic transport (114) >>>>> GO:0006605 protein targeting (162) >>>>> GO:0016192 vesicle-mediated transport (266) >>>>> GO:0051186 cofactor metabolic process (139) >>>>> GO:0006766 vitamin metabolic process (57) >>>>> GO:0006790 sulfur metabolic process (45) >>>>> GO:0006807 nitrogen compound metabolic process (224) >>>>> GO:0055086 nucleobase, nucleoside and nucleotide metabolic process >>>>> (118) >>>>> GO:0005975 carbohydrate metabolic process (199) >>>>> GO:0006629 lipid metabolic process (201) >>>>> GO:0006399 tRNA metabolic process (125) >>>>> GO:0006520 amino acid metabolic process (187) >>>>> GO:0006412 translation (357) >>>>> GO:0006259 DNA metabolic process (296) >>>>> GO:0006508 protolysis (223) >>>>> GO:0005975 carbohydrate metabolic process (199) >>>>> GO:0016071 mRNA metabolic process (204) >>>>> GO:0043413 biopolymer glycosylation (65) possibly drop? >>>>> GO:0006464 protein modification process (585) >>>>> GO:0007059 chromosome segregation (186) >>>>> GO:0007049 cell cycle (552) >>>>> GO:0007010 cytoskeletal organization and biogenesis (236) >>>>> GO:0000910 cytokinesis (145) >>>>> GO:0007165 signal transduction (362) >>>>> GO:0006457 protein folding (80) >>>>> GO:0042254 ribosome biogenesis and assembly (223) >>>>> GO:0045229 external encapsulating structure organization and >>>>> biogenesis (124) >>>>> GO:xxxxxxxx general transcription (see note *1 below) >>>>> GO:0032569 specific transcription from RNA polymerase II promoter (102) >>>>> (total 424 for all transcription) >>>>> GO:0000902 cell morphogenesis (86) >>>>> GO:0006338 establishment and/or maintenance of chromatin >>>>> architecture (231) >>>>> GO:reproductive process (182) >>>>> GO:0007005 mitochondrion organization and biogenesis (251) >>>>> GO:0006091 generation of precursor metabolites and energy (113) >>>>> GO:0007031 peroxisome organization and biogenesis (20) >>>>> >>>>> At this point there are about ~100 pombe genes (out of the 3960 with >>>>> an annotated process term) which aren't included in the slim >>>>> >>>>> I could also include.... >>>>> vacuolar transport (91) reduces by 6 (most also annotated to protein >>>>> targeting) >>>>> telomere maintenance (54) reduces by 6 (most also annotated to DNA met) >>>>> snoRNA metabolic process (10) reduces by 2 >>>>> ...to improve coverage (very slightly) >>>>> >>>>> Finally I include >>>>> GO:0006950 response to stress (444) >>>>> this terms has overlaps with most other processes so is largely >>>>> redundant but are useful. >>>>> >>>>> This leaves ~30 pombe with a process annotation unassigned to the >>>>> GO slim; these are often to terms like homeostasis and its children, >>>>> or otherwise uniformative terms >>>>> >>>>> For some purposes I would also include >>>>> GO:0065007 biological regulation (1021) >>>>> but I don't know if this is a good term to include in a generic slim >>>>> >>>>> To make this work for multicellular eukaryotes, we would probably >>>>> want to add non-cellular process terms like: >>>>> >>>>> developmental process >>>>> immune system process >>>>> >>>>> >>>>> * Note1 it is not currently possible to retrieve genes involved in >>>>> general transcription as opposed to gene specific transcription (i.e >>>>> RNA I,II and III polymerases etc), with a single query. This is >>>>> also important for enrichment as the genes in these 2 sets are very >>>>> different in terms of species distribution, copy number and >>>>> viability. I requested a grouping term for these processes a while >>>>> ago and hopefully this will be implemented shortly. >>>>> >>>>> See: >>>>> https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764 >>>>> >>>>> >>>>> >>>>> Val >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Ben Hitz wrote: >>>>> >>>>> >>>>>> Emily - >>>>>> I have interest in working on the generic go slim; I need it (or >>>>>> something similar) to define graphics for an interaction network. >>>>>> >>>>>> Ben >>>>>> >>>>>> >>>>>> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> From replying to a user request, I've just been having a quick >>>>>>> look at >>>>>>> the composition of the generic GO slim, and relating the GO terms >>>>>>> included to the number of annotations displayed by AmiGO. >>>>>>> >>>>>>> Should, for instance, the 'cell recognition' term still be >>>>>>> included in >>>>>>> the generic GO slim? - it has only been annotated to 182 gene >>>>>>> products, >>>>>>> whereas its sibling terms: 'cell division', 'cell cycle' and 'cell >>>>>>> motility', have not been included even though they (directly or >>>>>>> indirectly) have been annotated to more than 1,200 gene products >>>>>>> each. >>>>>>> Similarly, the term 'cytoplasm organization and biogenesis' is in >>>>>>> the GO >>>>>>> slim but only has 113 gps annotated, whereas the 'membrane >>>>>>> organisation >>>>>>> and biogenesis' term has been annotated to 1,509 gps. >>>>>>> >>>>>>> I was just wondering what the goal of the generic GO slim is... >>>>>>> if terms >>>>>>> are selected on the basis that as many annotated gene products from >>>>>>> different organisms should get mapped to descriptive GO terms before >>>>>>> they are caught by the BP, MF, CC root terms (while also providing a >>>>>>> full selection of terms across the whole GO vocabulary), should >>>>>>> we think >>>>>>> of reviewing its some of its composition in relation to overall >>>>>>> annotation frequency? Or should the GO slim be kept as stable as >>>>>>> possible? >>>>>>> >>>>>>> Cheers, >>>>>>> Emily >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------ >>>>>>> >>>>>>> Emily Dimmer Ph.D. >>>>>>> GOA Coordinator >>>>>>> EMBL-EBI >>>>>>> Wellcome Trust Genome Campus >>>>>>> Hinxton >>>>>>> Cambridge CB10 1SD, U.K. >>>>>>> Tel: +44 1223 494654 >>>>>>> Fax: +44 1223 494468 >>>>>>> email: edimmer at ebi.ac.uk >>>>>>> URL: http://www.ebi.ac.uk/goa >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Go mailing list >>>>>>> Go at geneontology.org >>>>>>> http://fafner.stanford.edu/mailman/listinfo/go >>>>>>> >>>>>>> >>>>>> -- >>>>>> Ben Hitz >>>>>> Senior Scientific Programmer ** Saccharomyces Genome Database ** >>>>>> GO Consortium >>>>>> Stanford University ** hitz at genome.stanford.edu >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Go mailing list >>>>>> Go at geneontology.org >>>>>> http://fafner.stanford.edu/mailman/listinfo/go >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >> > > From aji at ebi.ac.uk Mon May 5 16:06:24 2008 From: aji at ebi.ac.uk (Amelia Ireland) Date: Tue, 6 May 2008 00:06:24 +0100 (BST) Subject: [Go] Composition of the generic GO slim In-Reply-To: <481F3AF6.4040700@sanger.ac.uk> Message-ID: Back in Gotham City, Valerie Wood wrote: [snip Val's original GO slim] Val, this slim looks very useful and the principles that you used to make it are good. Perhaps it might be useful if a multicellular organism person (or persons) and a prokaryote person had a quick look to see if the slim would fit their organism, too, using the same criteria? The slim should also be expanded to cover the component and function ontologies, too. It would also be useful to have documented the criteria used for construction of the GO slim, not only to inform users of how the slim was created, but also to give potential GO slim creators some guidelines for how to create a maximally useful slim. [cutting to Val's response to Judy] >You are correct that no one slim is going to fit all organisms or all >uses. However it isn't simple to create an informative slim which gives >complete (or nearly complete) coverage of all of an organisms annotations >(and complete coverage of the annotation space is an important feature >of a robust slim). I would also add to this that constructing a good slim is a time consuming process, and for those users who eventually want to create their own custom slim, a good generic slim would be an invaluable time saver. It would be far easier to tinker with a complete slim than it would to create one from scratch. I also think that a lot of users don't necessarily want to expend time and effort on creating their own slim if they can use a pre-built one. Val knows the ins and outs of the GO pretty well (to judge by the number of SF items she's submitted, anyway! ;) ), and so she probably had a rough idea of what would go in before she started. A new user would have to learn the GO paradigm and examine the ontologies in some detail before they could even start constructing their own custom slim. This might prove too daunting or time-consuming a task, and hence put people off. I think providing an up-to-date generic slim that people can use as a starting point for further slimmage is essential. Cheers, Amelia. -- Amelia Ireland GO Editorial Office http://www.ebi.ac.uk || http://www.berkeleybop.org Carbon neutral driving: http://www.targetneutral.com/TONIC/index.jsp From midori at ebi.ac.uk Tue May 6 02:50:02 2008 From: midori at ebi.ac.uk (Midori Harris) Date: Tue, 6 May 2008 10:50:02 +0100 (BST) Subject: [Go] Alert: proposal to obsolete GO:00344423 (no annotations affected) Message-ID: The proposal has been made to obsolete autophagic vacuole lumen ; GO:0034424 This term is not used in annotations, and is not included in any subset maintained in GO. The reason is that this term was requested, but the request has been rescinded upon further discussion. The autophagic vacuole does have a lumen, but no gene products are known to reside and act there normally; rather, the lumen is more like a "trash can" where molecules temporarily reside before being degraded. SF link: https://sourceforge.net/tracker/?func=detail&atid=440764&aid=1954235&group_id=36855 Because the term was added only a few days ago, and has been used only for one mapping (that will be removed, as agreed in the SF discussion), it will be made obsolete in two days. *** Unless objections are received by May 8, we will assume that you agree to this change. *** Midori _______________________________________________ Go mailing list Go at geneontology.org http://fafner.stanford.edu/mailman/listinfo/go From midori at ebi.ac.uk Tue May 6 06:53:21 2008 From: midori at ebi.ac.uk (Midori Harris) Date: Tue, 6 May 2008 14:53:21 +0100 (BST) Subject: [Go] seeking comment on GO:0004772 (SF 1953281) Message-ID: Dear GO, SF 1953281 reports a problem with GO:0004772. At present the term name is 'sterol O-acyltransferase activity', but the definition is specific for cholesterol, not generally applicable to any sterol: id: GO:0004772 name: sterol O-acyltransferase activity namespace: molecular_function def: "Catalysis of the reaction: acyl-CoA + cholesterol = CoA + cholesterol ester." [EC:2.3.1.26] xref: EC:2.3.1.26 The definition is derived from the cross-referenced EC entry, and is also consistent with some, but not all, of the synonyms. It is clear that we will have to diverge from EC slightly, because the current term name their recommended name, and for GO we want the name and definition to be consistent. We intend to implement this structure: sterol O-acyltransferase activity GO:term1 --[i] cholesterol O-acyltransferase activity GO:term2 --[i] ergosterol O-acyltransferase activity GO:term3 --[i] lanosterol O-acyltransferase activity GO:term4 The question is whether GO:0004772 will become term1 or term2 in the new structure, i.e. whether to change its name or its definition. - If we change the name to match the definition, SGD, CGD and GeneDB S. pombe will have to change their annotations. - If we change the definition (and remove the EC dbxref) to match the name, MGI, RGD, UniProt and (I think) Flybase annotations will be less specific, and should be changed. - If we make the term obsolete, everyone will have to change annotations. This option is perhaps the most correct given the name/definition mismatch, but would mean the most work for annotators. SGD and CGD curators have expressed an understandable preference for the first option, renaming GO:0004772. Does anyone -- especially annotators who would have to change existing annotations to maintain specificity -- object? SF link: https://sourceforge.net/tracker/index.php?func=detail&aid=1953281&group_id=36855&atid=440764 Thanks, Midori From midori at ebi.ac.uk Tue May 6 09:01:13 2008 From: midori at ebi.ac.uk (midori at ebi.ac.uk) Date: Tue, 6 May 2008 16:01:13 UT Subject: [Go] SourceForge Update Message-ID: <200805061601.m46G1DN1462847@mozart.ebi.ac.uk> An HTML attachment was scrubbed... URL: http://fafner.stanford.edu/pipermail/go/attachments/20080506/34ce8ff2/attachment.html -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: not available Url: http://fafner.stanford.edu/pipermail/go/attachments/20080506/34ce8ff2/attachment.pl From midori at ebi.ac.uk Wed May 7 05:06:00 2008 From: midori at ebi.ac.uk (Midori Harris) Date: Wed, 7 May 2008 13:06:00 +0100 (BST) Subject: [Go] decision on Re: Alert: proposal to obsolete GO:0034423 In-Reply-To: References: Message-ID: Dear GO, Comments made in response to this proposal have led to the decision that GO:0034423 will not be made obsolete. Thanks to Chris for stating the case for its inclusion. I've added a comment to help avoid misleading annotations. Midori On Tue, 6 May 2008, Midori Harris wrote: > The proposal has been made to obsolete > > autophagic vacuole lumen ; GO:0034424 > > This term is not used in annotations, and is not included in any subset > maintained in GO. > > The reason is that this term was requested, but the request has been > rescinded upon further discussion. The autophagic vacuole does have a > lumen, but no gene products are known to reside and act there normally; > rather, the lumen is more like a "trash can" where molecules temporarily > reside before being degraded. > > SF link: > https://sourceforge.net/tracker/?func=detail&atid=440764&aid=1954235&group_id=36855 > > Because the term was added only a few days ago, and has been used only for > one mapping (that will be removed, as agreed in the SF discussion), it > will be made obsolete in two days. > > *** Unless objections are received by May 8, > we will assume that you agree to this change. *** > > Midori > > > > > > > _______________________________________________ > Go mailing list > Go at geneontology.org > http://fafner.stanford.edu/mailman/listinfo/go > _______________________________________________ > Go mailing list > Go at geneontology.org > http://fafner.stanford.edu/mailman/listinfo/go > From midori at ebi.ac.uk Wed May 7 09:01:19 2008 From: midori at ebi.ac.uk (midori at ebi.ac.uk) Date: Wed, 7 May 2008 16:01:19 UT Subject: [Go] SourceForge Update Message-ID: <200805071601.m47G1JH1205656@mozart.ebi.ac.uk> An HTML attachment was scrubbed... URL: http://fafner.stanford.edu/pipermail/go/attachments/20080507/4fc45eeb/attachment.html -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: not available Url: http://fafner.stanford.edu/pipermail/go/attachments/20080507/4fc45eeb/attachment.pl From midori at ebi.ac.uk Thu May 8 07:21:21 2008 From: midori at ebi.ac.uk (Midori Harris) Date: Thu, 8 May 2008 15:21:21 +0100 (BST) Subject: [Go] Alert: proposal to obsolete two biological process terms that impacts existing annotations Message-ID: Dear GO, The proposal has been made to obsolete nutrient import ; GO:0009935 nutrient export ; GO:0032524 Annotations to GO:0009935 exist as listed below: DB total GeneDB_Spombe 4 (no IEA) GeneDB_Tbrucei 2 (TAS) GOA_human 1 (TAS) GOA_pdb 6 (all IEA) GOA_uniprot 1048 (all IEA) MGI 1 (IMP) RGD 2 (1 ISS, 1 IEA) TAIR 1 (IMP) TIGR_Gsulfurreducens 1 (ISS) WB 2 (IMP) The reason for obsoleting this term is that "nutrient" is not defined, and does not have a consistent meaning. SourceForge link: https://sourceforge.net/tracker/index.php?func=detail&aid=1834028&group_id=36855&atid=440764 Comment period ends on May 27, 2008. *** Unless objections are received by May 27, we will assume that you agree to this change. *** Thanks, Midori From midori at ebi.ac.uk Thu May 8 09:03:00 2008 From: midori at ebi.ac.uk (Midori Harris) Date: Thu, 8 May 2008 17:03:00 +0100 (BST) Subject: [Go] Alert: proposal to obsolete GO:0030508 that impacts existing annotations In-Reply-To: References: Message-ID: Apologies for typo just before annotation counts; don't worry, I did search for GO:0030508 to get the numbers! Corrected below anyway. m On Thu, 8 May 2008, Midori Harris wrote: > Dear GO, > > The proposal has been made to obsolete > > thiol-disulfide exchange intermediate activity ; GO:0030508 > > Annotations to GO:0030508 exist as listed below: > > DB total > GeneDB_Pfalciparum 4 (1 TAS, 3 ISS) > GeneDB_Spombe 10 (several codes; no IEA) > GeneDB_Tbrucei 5 (all ISS) > CGD 5 (2 IEA, 3 NAS) > dictyBase 12 (several codes; no IEA) > FB 22 (several codes; no IEA)) > GOA_human 5 (2 IDA, 3 TAS) > MGI 1 (IDA) > Pseudocap 4 (all RCA) > RGD 5 (several codes; 1 IEA) > SGD 16 (several codes; no IEA) > TAIR 92 (1 ISS; rest RCA) > tigr_Aphagocytophilum 3 (all ISS) > tigr_Banthracis 5 (all ISS) > tigr_Cburnetii 4 (all ISS) > tigr_Chydrogenoformans 6 (all ISS) > tigr_Cjejuni 6 (all ISS) > tigr_Cperfringens 5 (all ISS) > tigr_Cpsychrerythraea 13 (all ISS) > tigr_Dethenogenes 3 (all ISS) > tigr_Echaffeensis 2 (all ISS) > tigr_Gsulfurreducens 7 (all ISS) > tigr_Hneptunium 4 (all ISS) > tigr_Lmonocytogenes 5 (all ISS) > tigr_Mcapsulatus 4 (all ISS) > tigr_Nsennetsu 2 (all ISS) > tigr_Pfluorescens 6 (all ISS) > tigr_Psyringae 5 (all ISS) > tigr_Psyringae_phaseolicola 4 (all ISS) > tigr_Soneidensis 7 (all ISS) > tigr_Spomeroyi 5 (all ISS) > tigr_Vcholerae 6 (all ISS) > > The reason for obsoleting this term is that it represents a combination of > gene product features and involvement in a biological process. Terms > suggested for annotation updates: > > protein thiol-disulfide exchange ; GO:0006467 (BP) > disulfide oxidoreductase activity ; GO:0015036 (MF) > > This term is in the prokaryote subset, but not in any GO slims maintained in > the gene_ontology_write.obo file. It is used in two external mappings: > > tigrfams2go:TIGR_TIGRFAMS:TIGR01068 thioredoxin > GO:thiol-disulfide exchange > intermediate activity ; GO:0030508 > > tigrfams2go:TIGR_TIGRFAMS:TIGR02181 glutaredoxin 3 > GO:thiol-disulfide > exchange intermediate activity ; GO:0030508 > > SourceForge link: > https://sourceforge.net/tracker/index.php?func=detail&aid=1036091&group_id=36855&atid=440764 > > Comment period ends on May 27, 2008. > > *** Unless objections are received by May 27, > we will assume that you agree to this change. *** > > Thanks, > Midori > > > From midori at ebi.ac.uk Thu May 8 08:59:53 2008 From: midori at ebi.ac.uk (Midori Harris) Date: Thu, 8 May 2008 16:59:53 +0100 (BST) Subject: [Go] Alert: proposal to obsolete GO:0030508 that impacts existing annotations Message-ID: Dear GO, The proposal has been made to obsolete thiol-disulfide exchange intermediate activity ; GO:0030508 Annotations to GO:0009935 exist as listed below: DB total GeneDB_Pfalciparum 4 (1 TAS, 3 ISS) GeneDB_Spombe 10 (several codes; no IEA) GeneDB_Tbrucei 5 (all ISS) CGD 5 (2 IEA, 3 NAS) dictyBase 12 (several codes; no IEA) FB 22 (several codes; no IEA)) GOA_human 5 (2 IDA, 3 TAS) MGI 1 (IDA) Pseudocap 4 (all RCA) RGD 5 (several codes; 1 IEA) SGD 16 (several codes; no IEA) TAIR 92 (1 ISS; rest RCA) tigr_Aphagocytophilum 3 (all ISS) tigr_Banthracis 5 (all ISS) tigr_Cburnetii 4 (all ISS) tigr_Chydrogenoformans 6 (all ISS) tigr_Cjejuni 6 (all ISS) tigr_Cperfringens 5 (all ISS) tigr_Cpsychrerythraea 13 (all ISS) tigr_Dethenogenes 3 (all ISS) tigr_Echaffeensis 2 (all ISS) tigr_Gsulfurreducens 7 (all ISS) tigr_Hneptunium 4 (all ISS) tigr_Lmonocytogenes 5 (all ISS) tigr_Mcapsulatus 4 (all ISS) tigr_Nsennetsu 2 (all ISS) tigr_Pfluorescens 6 (all ISS) tigr_Psyringae 5 (all ISS) tigr_Psyringae_phaseolicola 4 (all ISS) tigr_Soneidensis 7 (all ISS) tigr_Spomeroyi 5 (all ISS) tigr_Vcholerae 6 (all ISS) The reason for obsoleting this term is that it represents a combination of gene product features and involvement in a biological process. Terms suggested for annotation updates: protein thiol-disulfide exchange ; GO:0006467 (BP) disulfide oxidoreductase activity ; GO:0015036 (MF) This term is in the prokaryote subset, but not in any GO slims maintained in the gene_ontology_write.obo file. It is used in two external mappings: tigrfams2go:TIGR_TIGRFAMS:TIGR01068 thioredoxin > GO:thiol-disulfide exchange intermediate activity ; GO:0030508 tigrfams2go:TIGR_TIGRFAMS:TIGR02181 glutaredoxin 3 > GO:thiol-disulfide exchange intermediate activity ; GO:0030508 SourceForge link: https://sourceforge.net/tracker/index.php?func=detail&aid=1036091&group_id=36855&atid=440764 Comment period ends on May 27, 2008. *** Unless objections are received by May 27, we will assume that you agree to this change. *** Thanks, Midori From jane at ebi.ac.uk Fri May 9 03:14:17 2008 From: jane at ebi.ac.uk (Jane Lomax) Date: Fri, 09 May 2008 11:14:17 +0100 Subject: [Go] Composition of the generic GO slim In-Reply-To: <481F54A4.50504@informatics.jax.org> References: <481F54A4.50504@informatics.jax.org> Message-ID: <482423F9.5010105@ebi.ac.uk> Hi - sorry, only just got to this thread... From an advocacy point of view I think it's crucial for us to provide a generic GO slim that's up to date with the ontologies. As others have said, most naive users are not going to have the knowledge to create their own tailored slims in the beginning, so we need to provide something general for them to start from, especially as the pre-built slims are now part of the AmiGO GO slim mapper. Users can then trim or expand as they see fit for their own purposes as they become more familiar with the technology. Users blindly using the generic slim in a formal analysis without an understanding of the underlying mechanism are, quite frankly, not performing good science. This should be weeded out at the level of peer review, just the same as with any other misuse of bioinformatics apps. Perhaps the documentation for the generic GO slim might say something like: "GO provides a generic GO slim which, like the GO itself, is not species specific. This should be a suitable starting point for most investigations as it has broad coverage over most annotations. Users should tailor this GO slim according to the specific requirements of their own research". I like Val's suggestions for creating the generic GO slim - how about we set up a WG? Jane Judith Blake wrote: > agreed, > we should remove or change the text to reflect reality. > judy > > Valerie Wood wrote: > >> The GO website makes the following statement, which is a bit misleading if we don't intend to provide any comprehensive slims....(as Emily pointed out earlier in this thread, this isn't a comprehensive slim....) >> >> "GO provides a generic GO slim which, like the GO itself, is not species specific, and which should be suitable for most purposes. >> >> So maybe this slim should not be decribed as such? >> >> >> >> >> Judith Blake wrote: >> >> >>> Val, >>> My point really is that experiments are done in context. A person >>> studying metabolism may want to break out those terms by particular >>> sub-divisions and lump other things. One of the roles of collaborating >>> GO people would be to add in the construction of particular slims if >>> requested. >>> >>> For example, when I have done this, the researcher provided a list of >>> 12-16 subdivisions that made sense for their purpose, and we constructed >>> a GO_slim that subdivided the GO appropriately. I think of it as part >>> of the data analysis process. A researcher using a generic GO_slim >>> without understanding the vagaries of the annotations or of the ontology >>> subtrees will neither understand the results. >>> >>> my opinion. >>> judy >>> >>> Valerie Wood wrote: >>> >>> >>>> Judy, >>>> >>>> You are correct that no one slim is going to fit all organisms or all >>>> uses. >>>> However it isn't simple to create an informative slim which gives >>>> complete >>>> (or nearly complete) coverage of all of an organisms annotations (and >>>> complete >>>> coverage of the annotation space is an important feature >>>> of a robust slim). Does the drosophila slim set cover all of the >>>> annotated genes? >>>> >>>> The slim I suggested will give complete coverage for single-celled >>>> eukaryotes (it may need additional high level terms to cover >>>> muliticellular eukaryotes). This particular slim is useful for evaluating >>>> an organisms "cell biology". Perhaps a very generic slim, which only >>>> includes >>>> very high level terms would be useful multicellular organisms, >>>> but it would not be so useful for single-celled organisms. >>>> >>>> One suggested criteria (6 in previou) suggested that terms be >>>> meaningful to biologists. >>>> What I meant here was that the terms should be was that the terms should >>>> be 'biologically informative'. For cellular roles, or for a single-celled >>>> organism 'metabolism isn't so useful as a 'direct' slim term ( >>>> metabolic processes >>>> include transcription, translation, DNA replication, mRNA processing >>>> etc., >>>> in addition to primary and secondary metabolism). For pombe 3102 of >>>> 4194 process annotated gene products are annotated to metabolism, >>>> so this term in a slim does not tell you very much. >>>> >>>> In addition, if metabolism is included as a 'direct' slim term, and >>>> you have a gene product >>>> which is annotated ONLY to "metabolic process" then you really know very >>>> little about its biological role. This can occur as frequently as it >>>> is possible to >>>> predict that a protein has catalytic activity, and is involved in a >>>> 'metabolic process' >>>> but not to say anything more specific; there are many direct Interpro >>>> mappings >>>> to these two terms. If I was trying to assess the 'real biological >>>> roles' of my organisms >>>> gene products, I would wish to exclude direct annotations to >>>> 'metabolic process' from the slim. >>>> >>>> A GO slim provides a mechanism to filter out annotations to high level >>>> relatively uninformative (with respect to the biological role) nodes >>>> like >>>> 'metabolism, cellular process, localization' (in the slim, they will >>>> be annotated >>>> to 'unknown' if there is no annotation to one of your slim terms or >>>> their children). >>>> >>>> Once you exclude a term like metabolism it becomes necessary >>>> to include all of the child terms (or a combination of child terms ) >>>> to give complete >>>> coverage of the parent term ( NOTE: once the slimmed terms are mapped >>>> to the slim ontology the high level terms will be >>>> included, but their totals will only reflect the total of the gene >>>> products >>>> annotated via the terms in the slim). >>>> >>>> The difficult part is in building a slim is identifying the set of >>>> terms which >>>> provides complete coverage; this is the tricky step for most biologists, >>>> who are not so familiar with the ontologies. It would be useful to >>>> provide a >>>> starting slim which gives complete coverage of all annotations (using >>>> biologically relevant terms for common applications) which they can >>>> change as necessary. >>>> Maybe we should provide a set of 'complete coverage' slims for common >>>> applications. >>>> >>>> i.e. >>>> suitable for multicellular organisms and very general biological roles >>>> suitable for single-celled eukaryotes, or evaluating basic cellular >>>> processes >>>> >>>> Val >>>> >>>> >>>> >>>> >>>> Judith Blake wrote: >>>> >>>> >>>>> Val, >>>>> I still maintain that users need to be able to generate grouping >>>>> criteria based on their usage. I think we could go back to the fly >>>>> genome paper and see the primary molecular divisions that seemed most >>>>> useful to describe the genome properties. like 'reproduction' and >>>>> 'metabolism'. Anything more granular is specific to the user. A >>>>> mapping on this basis would likely include fewer than 20 terms and >>>>> would subdivide trees. >>>>> >>>>> judy >>>>> >>>>> Valerie Wood wrote: >>>>> >>>>> >>>>>> I think it is good idea for the consortium to provide an official >>>>>> 'GO slim', and advise people how they may want to alter the slim to >>>>>> fit their individual purpose. >>>>>> >>>>>> A useful generic GO slim has a number of qualities (I have tried to >>>>>> list these below, please suggest any additional ones, I hadn't >>>>>> really thought before about what the rules were I used for making a >>>>>> slim so this is the first time I have documented them). Following >>>>>> the 'guidelines' below I have suggested a set of process which I >>>>>> think should make up the generic process slim. >>>>>> >>>>>> Perhaps we could use this as a starting point, and people can >>>>>> suggest additional terms (with reasons) or terms which should be >>>>>> removed. This provides good coverage of basic cellular processes but >>>>>> would need extending to cover multicellular processes. >>>>>> >>>>>> GO Slim criteria >>>>>> >>>>>> 1. The generic slim should be as organism independent as possible >>>>>> (although clearly some terms will not be applicable to single celled >>>>>> eukaryotes and some eukaryotic terms will not be applicable to >>>>>> prokaryotes) >>>>>> >>>>>> 2. The slim should cover AS MANY genes with annotated processes as >>>>>> possible >>>>>> >>>>>> 3. The slim should cover AS MANY genes with annotated processes with >>>>>> the smallest number of leaf node terms (if you include too many >>>>>> terms and it becomes too large and you start to loose the advantages >>>>>> of a slim). >>>>>> >>>>>> 4. It might be useful to try to avoid terms with an excessively >>>>>> small or large number of small number of annotations (i.e ideally >>>>>> your terms will not have an extreme distributions for your histogram) >>>>>> >>>>>> 5. Preferably the slim should include sibling terms with a large >>>>>> overlaps between them. If you choose two siblings with 200 genes >>>>>> annotated to each, and the majority of the annotations overlap, it >>>>>> is usually better to select the parent node (i.e replace 2 terms by >>>>>> one single term). Conversely, if the child terms of a node fall >>>>>> into distinct non-overlapping subsets, it might be more informative >>>>>> to include both child terms in your slim (see also point 7 below) >>>>>> >>>>>> 6. For most purposes you need to include a representative term for >>>>>> all biologically relevant processes, by including terms which are >>>>>> meaningful to biologists. >>>>>> >>>>>> 7. If you are using your slim for data analysis (and not just for >>>>>> vizualization) you need to include terms which will allow you to >>>>>> distinguish genes bases on their biological properties. >>>>>> For example, it is not good to lump all genes involved in transport >>>>>> under transport because the genes annotated to distinct child terms; >>>>>> vesicle -mediated transport, protein targeting, transmembrane >>>>>> transport, are VERY different in term of their i) viability ii) >>>>>> species distribution iii) number of interaction partners iv) copy >>>>>> number v) expression pattern, so it does not make sense to lump them >>>>>> together in your slim set. >>>>>> >>>>>> Using these criteria this is the basic cellular process eukaryotic >>>>>> slim I use (or slight variations of): The number of annotations (for >>>>>> pombe obviously) is in parentheses (protein coding only). >>>>>> >>>>>> GO:0055085 transmembrane transport (278) >>>>>> GO:0006913 nucleocytoplasmic transport (114) >>>>>> GO:0006605 protein targeting (162) >>>>>> GO:0016192 vesicle-mediated transport (266) >>>>>> GO:0051186 cofactor metabolic process (139) >>>>>> GO:0006766 vitamin metabolic process (57) >>>>>> GO:0006790 sulfur metabolic process (45) >>>>>> GO:0006807 nitrogen compound metabolic process (224) >>>>>> GO:0055086 nucleobase, nucleoside and nucleotide metabolic process >>>>>> (118) >>>>>> GO:0005975 carbohydrate metabolic process (199) >>>>>> GO:0006629 lipid metabolic process (201) >>>>>> GO:0006399 tRNA metabolic process (125) >>>>>> GO:0006520 amino acid metabolic process (187) >>>>>> GO:0006412 translation (357) >>>>>> GO:0006259 DNA metabolic process (296) >>>>>> GO:0006508 protolysis (223) >>>>>> GO:0005975 carbohydrate metabolic process (199) >>>>>> GO:0016071 mRNA metabolic process (204) >>>>>> GO:0043413 biopolymer glycosylation (65) possibly drop? >>>>>> GO:0006464 protein modification process (585) >>>>>> GO:0007059 chromosome segregation (186) >>>>>> GO:0007049 cell cycle (552) >>>>>> GO:0007010 cytoskeletal organization and biogenesis (236) >>>>>> GO:0000910 cytokinesis (145) >>>>>> GO:0007165 signal transduction (362) >>>>>> GO:0006457 protein folding (80) >>>>>> GO:0042254 ribosome biogenesis and assembly (223) >>>>>> GO:0045229 external encapsulating structure organization and >>>>>> biogenesis (124) >>>>>> GO:xxxxxxxx general transcription (see note *1 below) >>>>>> GO:0032569 specific transcription from RNA polymerase II promoter (102) >>>>>> (total 424 for all transcription) >>>>>> GO:0000902 cell morphogenesis (86) >>>>>> GO:0006338 establishment and/or maintenance of chromatin >>>>>> architecture (231) >>>>>> GO:reproductive process (182) >>>>>> GO:0007005 mitochondrion organization and biogenesis (251) >>>>>> GO:0006091 generation of precursor metabolites and energy (113) >>>>>> GO:0007031 peroxisome organization and biogenesis (20) >>>>>> >>>>>> At this point there are about ~100 pombe genes (out of the 3960 with >>>>>> an annotated process term) which aren't included in the slim >>>>>> >>>>>> I could also include.... >>>>>> vacuolar transport (91) reduces by 6 (most also annotated to protein >>>>>> targeting) >>>>>> telomere maintenance (54) reduces by 6 (most also annotated to DNA met) >>>>>> snoRNA metabolic process (10) reduces by 2 >>>>>> ...to improve coverage (very slightly) >>>>>> >>>>>> Finally I include >>>>>> GO:0006950 response to stress (444) >>>>>> this terms has overlaps with most other processes so is largely >>>>>> redundant but are useful. >>>>>> >>>>>> This leaves ~30 pombe with a process annotation unassigned to the >>>>>> GO slim; these are often to terms like homeostasis and its children, >>>>>> or otherwise uniformative terms >>>>>> >>>>>> For some purposes I would also include >>>>>> GO:0065007 biological regulation (1021) >>>>>> but I don't know if this is a good term to include in a generic slim >>>>>> >>>>>> To make this work for multicellular eukaryotes, we would probably >>>>>> want to add non-cellular process terms like: >>>>>> >>>>>> developmental process >>>>>> immune system process >>>>>> >>>>>> >>>>>> * Note1 it is not currently possible to retrieve genes involved in >>>>>> general transcription as opposed to gene specific transcription (i.e >>>>>> RNA I,II and III polymerases etc), with a single query. This is >>>>>> also important for enrichment as the genes in these 2 sets are very >>>>>> different in terms of species distribution, copy number and >>>>>> viability. I requested a grouping term for these processes a while >>>>>> ago and hopefully this will be implemented shortly. >>>>>> >>>>>> See: >>>>>> https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764 >>>>>> >>>>>> >>>>>> >>>>>> Val >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Ben Hitz wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Emily - >>>>>>> I have interest in working on the generic go slim; I need it (or >>>>>>> something similar) to define graphics for an interaction network. >>>>>>> >>>>>>> Ben >>>>>>> >>>>>>> >>>>>>> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> From replying to a user request, I've just been having a quick >>>>>>>> look at >>>>>>>> the composition of the generic GO slim, and relating the GO terms >>>>>>>> included to the number of annotations displayed by AmiGO. >>>>>>>> >>>>>>>> Should, for instance, the 'cell recognition' term still be >>>>>>>> included in >>>>>>>> the generic GO slim? - it has only been annotated to 182 gene >>>>>>>> products, >>>>>>>> whereas its sibling terms: 'cell division', 'cell cycle' and 'cell >>>>>>>> motility', have not been included even though they (directly or >>>>>>>> indirectly) have been annotated to more than 1,200 gene products >>>>>>>> each. >>>>>>>> Similarly, the term 'cytoplasm organization and biogenesis' is in >>>>>>>> the GO >>>>>>>> slim but only has 113 gps annotated, whereas the 'membrane >>>>>>>> organisation >>>>>>>> and biogenesis' term has been annotated to 1,509 gps. >>>>>>>> >>>>>>>> I was just wondering what the goal of the generic GO slim is... >>>>>>>> if terms >>>>>>>> are selected on the basis that as many annotated gene products from >>>>>>>> different organisms should get mapped to descriptive GO terms before >>>>>>>> they are caught by the BP, MF, CC root terms (while also providing a >>>>>>>> full selection of terms across the whole GO vocabulary), should >>>>>>>> we think >>>>>>>> of reviewing its some of its composition in relation to overall >>>>>>>> annotation frequency? Or should the GO slim be kept as stable as >>>>>>>> possible? >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Emily >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------------ >>>>>>>> >>>>>>>> Emily Dimmer Ph.D. >>>>>>>> GOA Coordinator >>>>>>>> EMBL-EBI >>>>>>>> Wellcome Trust Genome Campus >>>>>>>> Hinxton >>>>>>>> Cambridge CB10 1SD, U.K. >>>>>>>> Tel: +44 1223 494654 >>>>>>>> Fax: +44 1223 494468 >>>>>>>> email: edimmer at ebi.ac.uk >>>>>>>> URL: http://www.ebi.ac.uk/goa >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Go mailing list >>>>>>>> Go at geneontology.org >>>>>>>> http://fafner.stanford.edu/mailman/listinfo/go >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> Ben Hitz >>>>>>> Senior Scientific Programmer ** Saccharomyces Genome Database ** >>>>>>> GO Consortium >>>>>>> Stanford University ** hitz at genome.stanford.edu >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Go mailing list >>>>>>> Go at geneontology.org >>>>>>> http://fafner.stanford.edu/mailman/listinfo/go >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > _______________________________________________ > Go mailing list > Go at geneontology.org > http://fafner.stanford.edu/mailman/listinfo/go > From jblake at informatics.jax.org Fri May 9 04:41:43 2008 From: jblake at informatics.jax.org (Judith Blake) Date: Fri, 09 May 2008 07:41:43 -0400 Subject: [Go] Composition of the generic GO slim In-Reply-To: <482423F9.5010105@ebi.ac.uk> References: <481F54A4.50504@informatics.jax.org> <482423F9.5010105@ebi.ac.uk> Message-ID: <48243877.3040603@informatics.jax.org> ahhhh not another WG :) I think it might be accomplished by taking the 12-16 subdivisions in either the human or fly genome papers that subdivide cellular roles, and look for similar sets in a text book for CC and MF by chapter titles. This number of subdivisions is the most useful for general overview. I think the single-cell concerns may not be so important at this level of 'genericism'; some subdivisions might be more or less devoid of annotaitons...or maybe it is useful to have two..but we could start with one. Then figure out how to sum GO to those terms. Start with the biology, not the ontology. of course, I biasly think the MGI go-slim accomplishes this to some extent. I think a draft of this could be done in a week by a dedicated curator. but who? I'll think about this. Judy Jane Lomax wrote: > Hi - sorry, only just got to this thread... > > From an advocacy point of view I think it's crucial for us to provide > a generic GO slim that's up to date with the ontologies. As others > have said, most naive users are not going to have the knowledge to > create their own tailored slims in the beginning, so we need to > provide something general for them to start from, especially as the > pre-built slims are now part of the AmiGO GO slim mapper. Users can > then trim or expand as they see fit for their own purposes as they > become more familiar with the technology. > > Users blindly using the generic slim in a formal analysis without an > understanding of the underlying mechanism are, quite frankly, not > performing good science. This should be weeded out at the level of > peer review, just the same as with any other misuse of bioinformatics > apps. > > Perhaps the documentation for the generic GO slim might say something > like: > > "GO provides a generic GO slim which, like the GO itself, is not > species specific. This should be a suitable starting point for most > investigations as it has broad coverage over most annotations. Users > should tailor this GO slim according to the specific requirements of > their own research". > > I like Val's suggestions for creating the generic GO slim - how about > we set up a WG? > > Jane > > Judith Blake wrote: >> agreed, >> we should remove or change the text to reflect reality. >> judy >> >> Valerie Wood wrote: >> >>> The GO website makes the following statement, which is a bit >>> misleading if we don't intend to provide any comprehensive >>> slims....(as Emily pointed out earlier in this thread, this isn't a >>> comprehensive slim....) >>> >>> "GO provides a generic GO slim which, like the GO itself, is not >>> species specific, and which should be suitable for most purposes. >>> >>> So maybe this slim should not be decribed as such? >>> >>> >>> >>> >>> Judith Blake wrote: >>>> Val, >>>> My point really is that experiments are done in context. A person >>>> studying metabolism may want to break out those terms by particular >>>> sub-divisions and lump other things. One of the roles of >>>> collaborating GO people would be to add in the construction of >>>> particular slims if requested. >>>> >>>> For example, when I have done this, the researcher provided a list >>>> of 12-16 subdivisions that made sense for their purpose, and we >>>> constructed a GO_slim that subdivided the GO appropriately. I >>>> think of it as part of the data analysis process. A researcher >>>> using a generic GO_slim without understanding the vagaries of the >>>> annotations or of the ontology subtrees will neither understand the >>>> results. >>>> >>>> my opinion. >>>> judy >>>> >>>> Valerie Wood wrote: >>>> >>>>> Judy, >>>>> >>>>> You are correct that no one slim is going to fit all organisms or >>>>> all uses. >>>>> However it isn't simple to create an informative slim which gives >>>>> complete >>>>> (or nearly complete) coverage of all of an organisms annotations >>>>> (and complete >>>>> coverage of the annotation space is an important feature >>>>> of a robust slim). Does the drosophila slim set cover all of the >>>>> annotated genes? >>>>> >>>>> The slim I suggested will give complete coverage for single-celled >>>>> eukaryotes (it may need additional high level terms to cover >>>>> muliticellular eukaryotes). This particular slim is useful for >>>>> evaluating >>>>> an organisms "cell biology". Perhaps a very generic slim, which >>>>> only includes >>>>> very high level terms would be useful multicellular organisms, >>>>> but it would not be so useful for single-celled organisms. >>>>> >>>>> One suggested criteria (6 in previou) suggested that terms be >>>>> meaningful to biologists. >>>>> What I meant here was that the terms should be was that the terms >>>>> should >>>>> be 'biologically informative'. For cellular roles, or for a >>>>> single-celled >>>>> organism 'metabolism isn't so useful as a 'direct' slim term ( >>>>> metabolic processes >>>>> include transcription, translation, DNA replication, mRNA >>>>> processing etc., >>>>> in addition to primary and secondary metabolism). For pombe 3102 >>>>> of 4194 process annotated gene products are annotated to metabolism, >>>>> so this term in a slim does not tell you very much. >>>>> >>>>> In addition, if metabolism is included as a 'direct' slim term, >>>>> and you have a gene product >>>>> which is annotated ONLY to "metabolic process" then you really >>>>> know very >>>>> little about its biological role. This can occur as frequently as >>>>> it is possible to >>>>> predict that a protein has catalytic activity, and is involved in >>>>> a 'metabolic process' >>>>> but not to say anything more specific; there are many direct >>>>> Interpro mappings >>>>> to these two terms. If I was trying to assess the 'real >>>>> biological roles' of my organisms >>>>> gene products, I would wish to exclude direct annotations to >>>>> 'metabolic process' from the slim. >>>>> >>>>> A GO slim provides a mechanism to filter out annotations to high >>>>> level >>>>> relatively uninformative (with respect to the biological role) >>>>> nodes like >>>>> 'metabolism, cellular process, localization' (in the slim, they >>>>> will be annotated >>>>> to 'unknown' if there is no annotation to one of your slim terms >>>>> or their children). >>>>> >>>>> Once you exclude a term like metabolism it becomes necessary >>>>> to include all of the child terms (or a combination of child terms >>>>> ) to give complete >>>>> coverage of the parent term ( NOTE: once the slimmed terms are mapped >>>>> to the slim ontology the high level terms will be >>>>> included, but their totals will only reflect the total of the >>>>> gene products >>>>> annotated via the terms in the slim). >>>>> >>>>> The difficult part is in building a slim is identifying the set of >>>>> terms which >>>>> provides complete coverage; this is the tricky step for most >>>>> biologists, >>>>> who are not so familiar with the ontologies. It would be useful to >>>>> provide a >>>>> starting slim which gives complete coverage of all annotations (using >>>>> biologically relevant terms for common applications) which they >>>>> can change as necessary. >>>>> Maybe we should provide a set of 'complete coverage' slims for common >>>>> applications. >>>>> >>>>> i.e. >>>>> suitable for multicellular organisms and very general biological >>>>> roles >>>>> suitable for single-celled eukaryotes, or evaluating basic >>>>> cellular processes >>>>> >>>>> Val >>>>> >>>>> >>>>> >>>>> >>>>> Judith Blake wrote: >>>>> >>>>>> Val, >>>>>> I still maintain that users need to be able to generate grouping >>>>>> criteria based on their usage. I think we could go back to the >>>>>> fly genome paper and see the primary molecular divisions that >>>>>> seemed most useful to describe the genome properties. like >>>>>> 'reproduction' and 'metabolism'. Anything more granular is >>>>>> specific to the user. A mapping on this basis would likely >>>>>> include fewer than 20 terms and would subdivide trees. >>>>>> >>>>>> judy >>>>>> >>>>>> Valerie Wood wrote: >>>>>> >>>>>>> I think it is good idea for the consortium to provide an >>>>>>> official 'GO slim', and advise people how they may want to alter >>>>>>> the slim to fit their individual purpose. >>>>>>> >>>>>>> A useful generic GO slim has a number of qualities (I have tried >>>>>>> to list these below, please suggest any additional ones, I >>>>>>> hadn't really thought before about what the rules were I used >>>>>>> for making a slim so this is the first time I have documented >>>>>>> them). Following the 'guidelines' below I have suggested a set >>>>>>> of process which I think should make up the generic process slim. >>>>>>> >>>>>>> Perhaps we could use this as a starting point, and people can >>>>>>> suggest additional terms (with reasons) or terms which should be >>>>>>> removed. This provides good coverage of basic cellular processes >>>>>>> but would need extending to cover multicellular processes. >>>>>>> >>>>>>> GO Slim criteria >>>>>>> >>>>>>> 1. The generic slim should be as organism independent as >>>>>>> possible (although clearly some terms will not be applicable to >>>>>>> single celled eukaryotes and some eukaryotic terms will not be >>>>>>> applicable to prokaryotes) >>>>>>> >>>>>>> 2. The slim should cover AS MANY genes with annotated processes >>>>>>> as possible >>>>>>> >>>>>>> 3. The slim should cover AS MANY genes with annotated processes >>>>>>> with the smallest number of leaf node terms (if you include too >>>>>>> many terms and it becomes too large and you start to loose the >>>>>>> advantages of a slim). >>>>>>> >>>>>>> 4. It might be useful to try to avoid terms with an excessively >>>>>>> small or large number of small number of annotations (i.e >>>>>>> ideally your terms will not have an extreme distributions for >>>>>>> your histogram) >>>>>>> >>>>>>> 5. Preferably the slim should include sibling terms with a >>>>>>> large overlaps between them. If you choose two siblings with 200 >>>>>>> genes annotated to each, and the majority of the annotations >>>>>>> overlap, it is usually better to select the parent node (i.e >>>>>>> replace 2 terms by one single term). Conversely, if the child >>>>>>> terms of a node fall into distinct non-overlapping subsets, it >>>>>>> might be more informative to include both child terms in your >>>>>>> slim (see also point 7 below) >>>>>>> >>>>>>> 6. For most purposes you need to include a representative term >>>>>>> for all biologically relevant processes, by including terms >>>>>>> which are meaningful to biologists. >>>>>>> >>>>>>> 7. If you are using your slim for data analysis (and not just >>>>>>> for vizualization) you need to include terms which will allow >>>>>>> you to distinguish genes bases on their biological properties. >>>>>>> For example, it is not good to lump all genes involved in >>>>>>> transport under transport because the genes annotated to >>>>>>> distinct child terms; vesicle -mediated transport, protein >>>>>>> targeting, transmembrane transport, are VERY different in term >>>>>>> of their i) viability ii) species distribution iii) number of >>>>>>> interaction partners iv) copy number v) expression pattern, so >>>>>>> it does not make sense to lump them together in your slim set. >>>>>>> >>>>>>> Using these criteria this is the basic cellular process >>>>>>> eukaryotic slim I use (or slight variations of): The number of >>>>>>> annotations (for pombe obviously) is in parentheses (protein >>>>>>> coding only). >>>>>>> >>>>>>> GO:0055085 transmembrane transport (278) >>>>>>> GO:0006913 nucleocytoplasmic transport (114) >>>>>>> GO:0006605 protein targeting (162) >>>>>>> GO:0016192 vesicle-mediated transport (266) >>>>>>> GO:0051186 cofactor metabolic process (139) >>>>>>> GO:0006766 vitamin metabolic process (57) >>>>>>> GO:0006790 sulfur metabolic process (45) >>>>>>> GO:0006807 nitrogen compound metabolic process (224) >>>>>>> GO:0055086 nucleobase, nucleoside and nucleotide metabolic >>>>>>> process (118) >>>>>>> GO:0005975 carbohydrate metabolic process (199) >>>>>>> GO:0006629 lipid metabolic process (201) >>>>>>> GO:0006399 tRNA metabolic process (125) >>>>>>> GO:0006520 amino acid metabolic process (187) >>>>>>> GO:0006412 translation (357) >>>>>>> GO:0006259 DNA metabolic process (296) >>>>>>> GO:0006508 protolysis (223) >>>>>>> GO:0005975 carbohydrate metabolic process (199) >>>>>>> GO:0016071 mRNA metabolic process (204) >>>>>>> GO:0043413 biopolymer glycosylation (65) possibly drop? >>>>>>> GO:0006464 protein modification process (585) >>>>>>> GO:0007059 chromosome segregation (186) >>>>>>> GO:0007049 cell cycle (552) >>>>>>> GO:0007010 cytoskeletal organization and biogenesis (236) >>>>>>> GO:0000910 cytokinesis (145) >>>>>>> GO:0007165 signal transduction (362) >>>>>>> GO:0006457 protein folding (80) >>>>>>> GO:0042254 ribosome biogenesis and assembly (223) >>>>>>> GO:0045229 external encapsulating structure organization and >>>>>>> biogenesis (124) >>>>>>> GO:xxxxxxxx general transcription (see note *1 below) >>>>>>> GO:0032569 specific transcription from RNA polymerase II >>>>>>> promoter (102) >>>>>>> (total 424 for all transcription) >>>>>>> GO:0000902 cell morphogenesis (86) >>>>>>> GO:0006338 establishment and/or maintenance of chromatin >>>>>>> architecture (231) >>>>>>> GO:reproductive process (182) >>>>>>> GO:0007005 mitochondrion organization and biogenesis (251) >>>>>>> GO:0006091 generation of precursor metabolites and energy (113) >>>>>>> GO:0007031 peroxisome organization and biogenesis (20) >>>>>>> >>>>>>> At this point there are about ~100 pombe genes (out of the 3960 >>>>>>> with an annotated process term) which aren't included in the slim >>>>>>> >>>>>>> I could also include.... >>>>>>> vacuolar transport (91) reduces by 6 (most also annotated to >>>>>>> protein targeting) >>>>>>> telomere maintenance (54) reduces by 6 (most also annotated to >>>>>>> DNA met) >>>>>>> snoRNA metabolic process (10) reduces by 2 >>>>>>> ...to improve coverage (very slightly) >>>>>>> >>>>>>> Finally I include >>>>>>> GO:0006950 response to stress (444) >>>>>>> this terms has overlaps with most other processes so is largely >>>>>>> redundant but are useful. >>>>>>> >>>>>>> This leaves ~30 pombe with a process annotation unassigned to >>>>>>> the GO slim; these are often to terms like homeostasis and its >>>>>>> children, or otherwise uniformative terms >>>>>>> >>>>>>> For some purposes I would also include >>>>>>> GO:0065007 biological regulation (1021) >>>>>>> but I don't know if this is a good term to include in a generic >>>>>>> slim >>>>>>> >>>>>>> To make this work for multicellular eukaryotes, we would >>>>>>> probably want to add non-cellular process terms like: >>>>>>> >>>>>>> developmental process >>>>>>> immune system process >>>>>>> >>>>>>> >>>>>>> * Note1 it is not currently possible to retrieve genes involved >>>>>>> in general transcription as opposed to gene specific >>>>>>> transcription (i.e RNA I,II and III polymerases etc), with a >>>>>>> single query. This is also important for enrichment as the genes >>>>>>> in these 2 sets are very different in terms of species >>>>>>> distribution, copy number and viability. I requested a grouping >>>>>>> term for these processes a while ago and hopefully this will be >>>>>>> implemented shortly. >>>>>>> >>>>>>> See: >>>>>>> https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764 >>>>>>> >>>>>>> >>>>>>> >>>>>>> Val >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Ben Hitz wrote: >>>>>>> >>>>>>> >>>>>>>> Emily - >>>>>>>> I have interest in working on the generic go slim; I need it >>>>>>>> (or something similar) to define graphics for an interaction >>>>>>>> network. >>>>>>>> >>>>>>>> Ben >>>>>>>> >>>>>>>> >>>>>>>> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote: >>>>>>>> >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> From replying to a user request, I've just been having a quick >>>>>>>>> look at >>>>>>>>> the composition of the generic GO slim, and relating the GO terms >>>>>>>>> included to the number of annotations displayed by AmiGO. >>>>>>>>> >>>>>>>>> Should, for instance, the 'cell recognition' term still be >>>>>>>>> included in >>>>>>>>> the generic GO slim? - it has only been annotated to 182 gene >>>>>>>>> products, >>>>>>>>> whereas its sibling terms: 'cell division', 'cell cycle' and >>>>>>>>> 'cell >>>>>>>>> motility', have not been included even though they (directly or >>>>>>>>> indirectly) have been annotated to more than 1,200 gene >>>>>>>>> products each. >>>>>>>>> Similarly, the term 'cytoplasm organization and biogenesis' is >>>>>>>>> in the GO >>>>>>>>> slim but only has 113 gps annotated, whereas the 'membrane >>>>>>>>> organisation >>>>>>>>> and biogenesis' term has been annotated to 1,509 gps. >>>>>>>>> >>>>>>>>> I was just wondering what the goal of the generic GO slim >>>>>>>>> is... if terms >>>>>>>>> are selected on the basis that as many annotated gene products >>>>>>>>> from >>>>>>>>> different organisms should get mapped to descriptive GO terms >>>>>>>>> before >>>>>>>>> they are caught by the BP, MF, CC root terms (while also >>>>>>>>> providing a >>>>>>>>> full selection of terms across the whole GO vocabulary), >>>>>>>>> should we think >>>>>>>>> of reviewing its some of its composition in relation to overall >>>>>>>>> annotation frequency? Or should the GO slim be kept as stable >>>>>>>>> as possible? >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Emily >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------ >>>>>>>>> >>>>>>>>> >>>>>>>>> Emily Dimmer Ph.D. >>>>>>>>> GOA Coordinator >>>>>>>>> EMBL-EBI >>>>>>>>> Wellcome Trust Genome Campus >>>>>>>>> Hinxton >>>>>>>>> Cambridge CB10 1SD, U.K. >>>>>>>>> Tel: +44 1223 494654 >>>>>>>>> Fax: +44 1223 494468 >>>>>>>>> email: edimmer at ebi.ac.uk >>>>>>>>> URL: http://www.ebi.ac.uk/goa >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Go mailing list >>>>>>>>> Go at geneontology.org >>>>>>>>> http://fafner.stanford.edu/mailman/listinfo/go >>>>>>>>> >>>>>>>> -- >>>>>>>> Ben Hitz >>>>>>>> Senior Scientific Programmer ** Saccharomyces Genome Database >>>>>>>> ** GO Consortium >>>>>>>> Stanford University ** hitz at genome.stanford.edu >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Go mailing list >>>>>>>> Go at geneontology.org >>>>>>>> http://fafner.stanford.edu/mailman/listinfo/go >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> _______________________________________________ >> Go mailing list >> Go at geneontology.org >> http://fafner.stanford.edu/mailman/listinfo/go >> > From jane at ebi.ac.uk Fri May 9 05:10:10 2008 From: jane at ebi.ac.uk (Jane Lomax) Date: Fri, 09 May 2008 13:10:10 +0100 Subject: [Go] Composition of the generic GO slim In-Reply-To: <48243877.3040603@informatics.jax.org> References: <481F54A4.50504@informatics.jax.org> <482423F9.5010105@ebi.ac.uk> <48243877.3040603@informatics.jax.org> Message-ID: <48243F22.5040308@ebi.ac.uk> I don't have any strong feelings about how the generic GO slim is generated, just as long as it's up-to-date and we have some documented, logical basis to how we do it. Lets not forget about this - it's important... Jane Judith Blake wrote: > ahhhh not another WG :) > > I think it might be accomplished by taking the 12-16 subdivisions in > either the human or fly genome papers that subdivide cellular roles, > and look for similar sets in a text book for CC and MF by chapter > titles. This number of subdivisions is the most useful for general > overview. I think the single-cell concerns may not be so important at > this level of 'genericism'; some subdivisions might be more or less > devoid of annotaitons...or maybe it is useful to have two..but we > could start with one. > > Then figure out how to sum GO to those terms. > > Start with the biology, not the ontology. > > of course, I biasly think the MGI go-slim accomplishes this to some > extent. > > I think a draft of this could be done in a week by a dedicated > curator. but who? I'll think about this. > > Judy > > > Jane Lomax wrote: >> Hi - sorry, only just got to this thread... >> >> From an advocacy point of view I think it's crucial for us to provide >> a generic GO slim that's up to date with the ontologies. As others >> have said, most naive users are not going to have the knowledge to >> create their own tailored slims in the beginning, so we need to >> provide something general for them to start from, especially as the >> pre-built slims are now part of the AmiGO GO slim mapper. Users can >> then trim or expand as they see fit for their own purposes as they >> become more familiar with the technology. >> >> Users blindly using the generic slim in a formal analysis without an >> understanding of the underlying mechanism are, quite frankly, not >> performing good science. This should be weeded out at the level of >> peer review, just the same as with any other misuse of bioinformatics >> apps. >> >> Perhaps the documentation for the generic GO slim might say something >> like: >> >> "GO provides a generic GO slim which, like the GO itself, is not >> species specific. This should be a suitable starting point for most >> investigations as it has broad coverage over most annotations. Users >> should tailor this GO slim according to the specific requirements of >> their own research". >> >> I like Val's suggestions for creating the generic GO slim - how about >> we set up a WG? >> >> Jane >> >> Judith Blake wrote: >>> agreed, >>> we should remove or change the text to reflect reality. >>> judy >>> >>> Valerie Wood wrote: >>> >>>> The GO website makes the following statement, which is a bit >>>> misleading if we don't intend to provide any comprehensive >>>> slims....(as Emily pointed out earlier in this thread, this isn't a >>>> comprehensive slim....) >>>> >>>> "GO provides a generic GO slim which, like the GO itself, is not >>>> species specific, and which should be suitable for most purposes. >>>> >>>> So maybe this slim should not be decribed as such? >>>> >>>> >>>> >>>> >>>> Judith Blake wrote: >>>>> Val, >>>>> My point really is that experiments are done in context. A person >>>>> studying metabolism may want to break out those terms by >>>>> particular sub-divisions and lump other things. One of the roles >>>>> of collaborating GO people would be to add in the construction of >>>>> particular slims if requested. >>>>> >>>>> For example, when I have done this, the researcher provided a list >>>>> of 12-16 subdivisions that made sense for their purpose, and we >>>>> constructed a GO_slim that subdivided the GO appropriately. I >>>>> think of it as part of the data analysis process. A researcher >>>>> using a generic GO_slim without understanding the vagaries of the >>>>> annotations or of the ontology subtrees will neither understand >>>>> the results. >>>>> >>>>> my opinion. >>>>> judy >>>>> >>>>> Valerie Wood wrote: >>>>> >>>>>> Judy, >>>>>> >>>>>> You are correct that no one slim is going to fit all organisms >>>>>> or all uses. >>>>>> However it isn't simple to create an informative slim which >>>>>> gives complete >>>>>> (or nearly complete) coverage of all of an organisms annotations >>>>>> (and complete >>>>>> coverage of the a