The GO website makes the following statement, which is a bit misleading if we don't intend to provide any comprehensive slims....(as Emily pointed out earlier in this thread, this isn't a comprehensive slim....) "GO provides a generic GO slim which, like the GO itself, is not species specific, and which should be suitable for most purposes. So maybe this slim should not be decribed as such? Judith Blake wrote: > Val, > My point really is that experiments are done in context. A person > studying metabolism may want to break out those terms by particular > sub-divisions and lump other things. One of the roles of collaborating > GO people would be to add in the construction of particular slims if > requested. > > For example, when I have done this, the researcher provided a list of > 12-16 subdivisions that made sense for their purpose, and we constructed > a GO_slim that subdivided the GO appropriately. I think of it as part > of the data analysis process. A researcher using a generic GO_slim > without understanding the vagaries of the annotations or of the ontology > subtrees will neither understand the results. > > my opinion. > judy > > Valerie Wood wrote: > > Judy, > > > > You are correct that no one slim is going to fit all organisms or all > > uses. > > However it isn't simple to create an informative slim which gives > > complete > > (or nearly complete) coverage of all of an organisms annotations (and > > complete > > coverage of the annotation space is an important feature > > of a robust slim). Does the drosophila slim set cover all of the > > annotated genes? > > > > The slim I suggested will give complete coverage for single-celled > > eukaryotes (it may need additional high level terms to cover > > muliticellular eukaryotes). This particular slim is useful for evaluating > > an organisms "cell biology". Perhaps a very generic slim, which only > > includes > > very high level terms would be useful multicellular organisms, > > but it would not be so useful for single-celled organisms. > > > > One suggested criteria (6 in previou) suggested that terms be > > meaningful to biologists. > > What I meant here was that the terms should be was that the terms should > > be 'biologically informative'. For cellular roles, or for a single-celled > > organism 'metabolism isn't so useful as a 'direct' slim term ( > > metabolic processes > > include transcription, translation, DNA replication, mRNA processing > > etc., > > in addition to primary and secondary metabolism). For pombe 3102 of > > 4194 process annotated gene products are annotated to metabolism, > > so this term in a slim does not tell you very much. > > > > In addition, if metabolism is included as a 'direct' slim term, and > > you have a gene product > > which is annotated ONLY to "metabolic process" then you really know very > > little about its biological role. This can occur as frequently as it > > is possible to > > predict that a protein has catalytic activity, and is involved in a > > 'metabolic process' > > but not to say anything more specific; there are many direct Interpro > > mappings > > to these two terms. If I was trying to assess the 'real biological > > roles' of my organisms > > gene products, I would wish to exclude direct annotations to > > 'metabolic process' from the slim. > > > > A GO slim provides a mechanism to filter out annotations to high level > > relatively uninformative (with respect to the biological role) nodes > > like > > 'metabolism, cellular process, localization' (in the slim, they will > > be annotated > > to 'unknown' if there is no annotation to one of your slim terms or > > their children). > > > > Once you exclude a term like metabolism it becomes necessary > > to include all of the child terms (or a combination of child terms ) > > to give complete > > coverage of the parent term ( NOTE: once the slimmed terms are mapped > > to the slim ontology the high level terms will be > > included, but their totals will only reflect the total of the gene > > products > > annotated via the terms in the slim). > > > > The difficult part is in building a slim is identifying the set of > > terms which > > provides complete coverage; this is the tricky step for most biologists, > > who are not so familiar with the ontologies. It would be useful to > > provide a > > starting slim which gives complete coverage of all annotations (using > > biologically relevant terms for common applications) which they can > > change as necessary. > > Maybe we should provide a set of 'complete coverage' slims for common > > applications. > > > > i.e. > > suitable for multicellular organisms and very general biological roles > > suitable for single-celled eukaryotes, or evaluating basic cellular > > processes > > > > Val > > > > > > > > > > Judith Blake wrote: > >> Val, > >> I still maintain that users need to be able to generate grouping > >> criteria based on their usage. I think we could go back to the fly > >> genome paper and see the primary molecular divisions that seemed most > >> useful to describe the genome properties. like 'reproduction' and > >> 'metabolism'. Anything more granular is specific to the user. A > >> mapping on this basis would likely include fewer than 20 terms and > >> would subdivide trees. > >> > >> judy > >> > >> Valerie Wood wrote: > >>> I think it is good idea for the consortium to provide an official > >>> 'GO slim', and advise people how they may want to alter the slim to > >>> fit their individual purpose. > >>> > >>> A useful generic GO slim has a number of qualities (I have tried to > >>> list these below, please suggest any additional ones, I hadn't > >>> really thought before about what the rules were I used for making a > >>> slim so this is the first time I have documented them). Following > >>> the 'guidelines' below I have suggested a set of process which I > >>> think should make up the generic process slim. > >>> > >>> Perhaps we could use this as a starting point, and people can > >>> suggest additional terms (with reasons) or terms which should be > >>> removed. This provides good coverage of basic cellular processes but > >>> would need extending to cover multicellular processes. > >>> > >>> GO Slim criteria > >>> > >>> 1. The generic slim should be as organism independent as possible > >>> (although clearly some terms will not be applicable to single celled > >>> eukaryotes and some eukaryotic terms will not be applicable to > >>> prokaryotes) > >>> > >>> 2. The slim should cover AS MANY genes with annotated processes as > >>> possible > >>> > >>> 3. The slim should cover AS MANY genes with annotated processes with > >>> the smallest number of leaf node terms (if you include too many > >>> terms and it becomes too large and you start to loose the advantages > >>> of a slim). > >>> > >>> 4. It might be useful to try to avoid terms with an excessively > >>> small or large number of small number of annotations (i.e ideally > >>> your terms will not have an extreme distributions for your histogram) > >>> > >>> 5. Preferably the slim should include sibling terms with a large > >>> overlaps between them. If you choose two siblings with 200 genes > >>> annotated to each, and the majority of the annotations overlap, it > >>> is usually better to select the parent node (i.e replace 2 terms by > >>> one single term). Conversely, if the child terms of a node fall > >>> into distinct non-overlapping subsets, it might be more informative > >>> to include both child terms in your slim (see also point 7 below) > >>> > >>> 6. For most purposes you need to include a representative term for > >>> all biologically relevant processes, by including terms which are > >>> meaningful to biologists. > >>> > >>> 7. If you are using your slim for data analysis (and not just for > >>> vizualization) you need to include terms which will allow you to > >>> distinguish genes bases on their biological properties. > >>> For example, it is not good to lump all genes involved in transport > >>> under transport because the genes annotated to distinct child terms; > >>> vesicle -mediated transport, protein targeting, transmembrane > >>> transport, are VERY different in term of their i) viability ii) > >>> species distribution iii) number of interaction partners iv) copy > >>> number v) expression pattern, so it does not make sense to lump them > >>> together in your slim set. > >>> > >>> Using these criteria this is the basic cellular process eukaryotic > >>> slim I use (or slight variations of): The number of annotations (for > >>> pombe obviously) is in parentheses (protein coding only). > >>> > >>> GO:0055085 transmembrane transport (278) > >>> GO:0006913 nucleocytoplasmic transport (114) > >>> GO:0006605 protein targeting (162) > >>> GO:0016192 vesicle-mediated transport (266) > >>> GO:0051186 cofactor metabolic process (139) > >>> GO:0006766 vitamin metabolic process (57) > >>> GO:0006790 sulfur metabolic process (45) > >>> GO:0006807 nitrogen compound metabolic process (224) > >>> GO:0055086 nucleobase, nucleoside and nucleotide metabolic process > >>> (118) > >>> GO:0005975 carbohydrate metabolic process (199) > >>> GO:0006629 lipid metabolic process (201) > >>> GO:0006399 tRNA metabolic process (125) > >>> GO:0006520 amino acid metabolic process (187) > >>> GO:0006412 translation (357) > >>> GO:0006259 DNA metabolic process (296) > >>> GO:0006508 protolysis (223) > >>> GO:0005975 carbohydrate metabolic process (199) > >>> GO:0016071 mRNA metabolic process (204) > >>> GO:0043413 biopolymer glycosylation (65) possibly drop? > >>> GO:0006464 protein modification process (585) > >>> GO:0007059 chromosome segregation (186) > >>> GO:0007049 cell cycle (552) > >>> GO:0007010 cytoskeletal organization and biogenesis (236) > >>> GO:0000910 cytokinesis (145) > >>> GO:0007165 signal transduction (362) > >>> GO:0006457 protein folding (80) > >>> GO:0042254 ribosome biogenesis and assembly (223) > >>> GO:0045229 external encapsulating structure organization and > >>> biogenesis (124) > >>> GO:xxxxxxxx general transcription (see note *1 below) > >>> GO:0032569 specific transcription from RNA polymerase II promoter (102) > >>> (total 424 for all transcription) > >>> GO:0000902 cell morphogenesis (86) > >>> GO:0006338 establishment and/or maintenance of chromatin > >>> architecture (231) > >>> GO:reproductive process (182) > >>> GO:0007005 mitochondrion organization and biogenesis (251) > >>> GO:0006091 generation of precursor metabolites and energy (113) > >>> GO:0007031 peroxisome organization and biogenesis (20) > >>> > >>> At this point there are about ~100 pombe genes (out of the 3960 with > >>> an annotated process term) which aren't included in the slim > >>> > >>> I could also include.... > >>> vacuolar transport (91) reduces by 6 (most also annotated to protein > >>> targeting) > >>> telomere maintenance (54) reduces by 6 (most also annotated to DNA met) > >>> snoRNA metabolic process (10) reduces by 2 > >>> ...to improve coverage (very slightly) > >>> > >>> Finally I include > >>> GO:0006950 response to stress (444) > >>> this terms has overlaps with most other processes so is largely > >>> redundant but are useful. > >>> > >>> This leaves ~30 pombe with a process annotation unassigned to the > >>> GO slim; these are often to terms like homeostasis and its children, > >>> or otherwise uniformative terms > >>> > >>> For some purposes I would also include > >>> GO:0065007 biological regulation (1021) > >>> but I don't know if this is a good term to include in a generic slim > >>> > >>> To make this work for multicellular eukaryotes, we would probably > >>> want to add non-cellular process terms like: > >>> > >>> developmental process > >>> immune system process > >>> > >>> > >>> * Note1 it is not currently possible to retrieve genes involved in > >>> general transcription as opposed to gene specific transcription (i.e > >>> RNA I,II and III polymerases etc), with a single query. This is > >>> also important for enrichment as the genes in these 2 sets are very > >>> different in terms of species distribution, copy number and > >>> viability. I requested a grouping term for these processes a while > >>> ago and hopefully this will be implemented shortly. > >>> > >>> See: > >>> https://sourceforge.net/tracker/?func=detail&aid=1590000&group_id=36855&atid=440764 > >>> > >>> > >>> > >>> Val > >>> > >>> > >>> > >>> > >>> > >>> > >>> Ben Hitz wrote: > >>> > >>>> Emily - > >>>> I have interest in working on the generic go slim; I need it (or > >>>> something similar) to define graphics for an interaction network. > >>>> > >>>> Ben > >>>> > >>>> > >>>> On Apr 30, 2008, at 10:03 AM, Emily Dimmer wrote: > >>>> > >>>> > >>>>> Hi, > >>>>> > >>>>> From replying to a user request, I've just been having a quick > >>>>> look at > >>>>> the composition of the generic GO slim, and relating the GO terms > >>>>> included to the number of annotations displayed by AmiGO. > >>>>> > >>>>> Should, for instance, the 'cell recognition' term still be > >>>>> included in > >>>>> the generic GO slim? - it has only been annotated to 182 gene > >>>>> products, > >>>>> whereas its sibling terms: 'cell division', 'cell cycle' and 'cell > >>>>> motility', have not been included even though they (directly or > >>>>> indirectly) have been annotated to more than 1,200 gene products > >>>>> each. > >>>>> Similarly, the term 'cytoplasm organization and biogenesis' is in > >>>>> the GO > >>>>> slim but only has 113 gps annotated, whereas the 'membrane > >>>>> organisation > >>>>> and biogenesis' term has been annotated to 1,509 gps. > >>>>> > >>>>> I was just wondering what the goal of the generic GO slim is... > >>>>> if terms > >>>>> are selected on the basis that as many annotated gene products from > >>>>> different organisms should get mapped to descriptive GO terms before > >>>>> they are caught by the BP, MF, CC root terms (while also providing a > >>>>> full selection of terms across the whole GO vocabulary), should > >>>>> we think > >>>>> of reviewing its some of its composition in relation to overall > >>>>> annotation frequency? Or should the GO slim be kept as stable as > >>>>> possible? > >>>>> > >>>>> Cheers, > >>>>> Emily > >>>>> > >>>>> -- > >>>>> > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------ > >>>>> > >>>>> Emily Dimmer Ph.D. > >>>>> GOA Coordinator > >>>>> EMBL-EBI > >>>>> Wellcome Trust Genome Campus > >>>>> Hinxton > >>>>> Cambridge CB10 1SD, U.K. > >>>>> Tel: +44 1223 494654 > >>>>> Fax: +44 1223 494468 > >>>>> email: edimmer@ebi.ac.uk > >>>>> URL: http://www.ebi.ac.uk/goa > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Go mailing list > >>>>> Go@geneontology.org > >>>>> http://fafner.stanford.edu/mailman/listinfo/go > >>>>> > >>>> -- > >>>> Ben Hitz > >>>> Senior Scientific Programmer ** Saccharomyces Genome Database ** > >>>> GO Consortium > >>>> Stanford University ** hitz@genome.stanford.edu > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Go mailing list > >>>> Go@geneontology.org > >>>> http://fafner.stanford.edu/mailman/listinfo/go > >>>> > >>>> > >>>> > >>>> > >>> > >>> > >>> > >> > >> > >> > > > > > >