From gberriz at hms.harvard.edu Mon Sep 8 14:49:03 2008 From: gberriz at hms.harvard.edu (Gabriel Berriz) Date: Mon, 8 Sep 2008 17:49:03 -0400 Subject: [Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt Message-ID: <31552965-46E2-46A9-9C76-92C7EE3D179F@hms.harvard.edu> Dear GO friends, For some species, the info given in the dbxref table includes IDs from multiple databases, which raises the possibility of "cryptic redundancies", i.e. associations distinct because they are assigned IDs from different databases that in fact refer to the same underlying gene product. For example, if I compare the sets of rat dbxref's that have database names RGD and ENSEMBL respectiely, I find that the overlap (redundancy) of these two sets of IDs consists of about 1000 IDs, which is over 5% all the possible rat gene products. (To compute this overlap, I first mapped the RGD IDs to ENSEMBL IDs using the mappings provided by Ensembl version 49.) Would it help to avoid these "cryptic redundancies" if a single database (Ensembl, RGD, whatever) was used for each species? Thanks for your comments, Gabriel Berriz ============================================================= Gabriel F. Berriz, PhD Bioinformatics Developer Roth Lab Biological Chemistry and Molecular Pharmacology -- Harvard Medical School Seeley G. Mudd Building 322B Boston, MA 02115-5701 Telephone: 617.432.3555 Fax: 617.432.3557 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jblake at informatics.jax.org Mon Sep 8 15:09:22 2008 From: jblake at informatics.jax.org (Judith Blake) Date: Mon, 08 Sep 2008 18:09:22 -0400 Subject: [Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt In-Reply-To: <31552965-46E2-46A9-9C76-92C7EE3D179F@hms.harvard.edu> References: <31552965-46E2-46A9-9C76-92C7EE3D179F@hms.harvard.edu> Message-ID: <48C5A292.9030005@informatics.jax.org> Gabriel, The gene association files are non-redundant. Primary model organisms have responsibility for integrating annotations from mulitple sources and submitting a non-redundant file to the GOdb. QC checks on the files also remove redundancies. The details of annotation files are here... http://www.geneontology.org/GO.format.annotation.shtml Further details can be obtained from the GO_help desk http://www.geneontology.org/GO.contacts.shtml Judy Gabriel Berriz wrote: > Dear GO friends, > > For some species, the info given in the dbxref table includes IDs from > multiple databases, which raises the possibility of "cryptic > redundancies", i.e. associations distinct because they are assigned > IDs from different databases that in fact refer to the same underlying > gene product. For example, if I compare the sets of rat dbxref's that > have database names RGD and ENSEMBL respectiely, I find that the > overlap (redundancy) of these two sets of IDs consists of about 1000 > IDs, which is over 5% all the possible rat gene products. (To compute > this overlap, I first mapped the RGD IDs to ENSEMBL IDs using the > mappings provided by Ensembl version 49.) > > Would it help to avoid these "cryptic redundancies" if a single > database (Ensembl, RGD, whatever) was used for each species? > > Thanks for your comments, > > Gabriel Berriz > > > > ============================================================= > Gabriel F. Berriz, PhD > Bioinformatics Developer > Roth Lab > Biological Chemistry and Molecular Pharmacology -- Harvard Medical School > Seeley G. Mudd Building 322B > Boston, MA 02115-5701 > Telephone: 617.432.3555 > Fax: 617.432.3557 > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Gofriends mailing list > Gofriends at geneontology.org > http://fafner.stanford.edu/mailman/listinfo/gofriends > From FMcCarthy at cvm.msstate.edu Mon Sep 8 15:28:01 2008 From: FMcCarthy at cvm.msstate.edu (Fiona McCarthy) Date: Mon, 08 Sep 2008 17:28:01 -0500 Subject: [Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt In-Reply-To: <31552965-46E2-46A9-9C76-92C7EE3D179F@hms.harvard.edu> References: <31552965-46E2-46A9-9C76-92C7EE3D179F@hms.harvard.edu> Message-ID: Hi Gabriel, I am not sure about rat but I did notice earlier in the year that Ensembl was reporting GO in a very strange way - it was directly attributing GO annotations from human orthologs to the chicken & cow genes. This was causing a lot of apparent redundancies when I scanned the GO annotations but when I checked them more carefully, many of the Ensembl GO annotations made no sense. This happened in version 49 and was supposed to be fixed in version 50. Does this sound like what you are seeing? Fiona Gabriel Berriz on Monday, September 08, 2008 at 4:49 PM +0000 wrote: >Dear GO friends, > > >For some species, the info given in the dbxref table includes IDs from >multiple databases, which raises the possibility of "cryptic >redundancies", i.e. associations distinct because they are assigned IDs >from different databases that in fact refer to the same underlying gene >product. For example, if I compare the sets of rat dbxref's that have >database names RGD and ENSEMBL respectiely, I find that the overlap >(redundancy) of these two sets of IDs consists of about 1000 IDs, which >is over 5% all the possible rat gene products. (To compute this overlap, >I first mapped the RGD IDs to ENSEMBL IDs using the mappings provided by >Ensembl version 49.) > > >Would it help to avoid these "cryptic redundancies" if a single database >(Ensembl, RGD, whatever) was used for each species? > > >Thanks for your comments, > > >Gabriel Berriz The AgBase Databases Department of Basic Sciences Box 6100 MS 39762-6100 Mississippi State University USA Tel: (+ 1) 662 325 5859 Fax: (+ 1) 662 325 1031 http://www.agbase.msstate.edu/ From d.m.a.martin at dundee.ac.uk Tue Sep 9 05:55:34 2008 From: d.m.a.martin at dundee.ac.uk (David Martin) Date: Tue, 09 Sep 2008 13:55:34 +0100 Subject: [Gofriends] Error uploading MySQL dump Message-ID: <48C67246.6040108@dundee.ac.uk> Trying to import the term_synonym table (term_synonym.sql) I get the following MySQL error: ERROR 1170 (42000) at line 16: BLOB/TEXT column 'term_synonym' used in key specification without a key length MySQL version is 4.1.20-log ..d From gberriz at hms.harvard.edu Tue Sep 9 10:22:49 2008 From: gberriz at hms.harvard.edu (Gabriel Berriz) Date: Tue, 9 Sep 2008 13:22:49 -0400 Subject: [Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt In-Reply-To: <48C5A292.9030005@informatics.jax.org> References: <31552965-46E2-46A9-9C76-92C7EE3D179F@hms.harvard.edu> <48C5A292.9030005@informatics.jax.org> Message-ID: <557C69FA-5A9D-4AAE-B1CA-74822E8D3C8E@hms.harvard.edu> On 2008.09.08 Mon, at 18:09, Judith Blake wrote: > Gabriel, > > The gene association files are non-redundant. Primary model organisms > have responsibility for integrating annotations from mulitple sources > and submitting a non-redundant file to the GOdb. QC checks on the > files > also remove redundancies. Hi, Judy. My word choice was not a very good one when I wrote of "redundancies", so let me give an example of what I meant. It comes from the latest gene_association.rgd.gz file. (This example is the first one I followed up on of the 1000 or so that I mentioned in my previous email.) The latest gene_association.rgd.gz file contains 15 associations for RGD ID 1302948, and 4 associations for ENSEMBL ID ENSRNOP00000034933. In fact, according to both Ensembl and RGD (http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948 ) these two identifiers both refer to the same entity (transforming acidic coiled-coil containing protein 3, aka Tacc3). Hence, the file uses two names for the same thing. Why? The reason why I bring this problem up is that, in our work, we compute statistics that are very sensitive to how many genes have a particular GO attribute, therefore it is crucial for us to count the associations in this example as being 19 belonging to the same protein, rather than 15 belonging to one and 4 belonging to another. This accounting task is made significantly more difficult by the fact that the association file uses two different names for the same thing. Maybe I'm wrong here, but this looks to me like a bug rather than a feature: I can't see that any good could come of using multiple names for the same thing in a document like this. If it is indeed a bug, would it be too difficult to fix? I.e. would it be too difficult for GO and the purveyors of associations files to use a consistent nomenclature whenever possible? If it's of any help with this, we have a tool, called Synergizer, for bulk mapping of identifiers from one namespace to another, and it is a simple matter to set up a pipeline to do it automatically (see http://llama.med.harvard.edu/synergizer/doc) . We'd be happy to help with this in any way we can. (Although I imagine that the organizations that generate such associations files are the ultimate experts for resolving such nomenclature issues.) Also, as I said earlier, the example above is not isolated. For R. norvegicus alone there are about 1000, and that's only focusing on RGD vs. ENSEMBL IDs. And the problem is not limited to R. norvegicus. Among the organisms that I have analyzed, I found a similar nomenclature inconsistencies with several others, including B. taurus, G. gallus, C. elegans, and H. sapiens. Thanks for your comments! Gabriel Berriz ============================================================= Gabriel F. Berriz, PhD Bioinformatics Developer Roth Lab Biological Chemistry and Molecular Pharmacology -- Harvard Medical School Seeley G. Mudd Building 322B Boston, MA 02115-5701 Telephone: 617.432.3555 Fax: 617.432.3557 -------------- next part -------------- An HTML attachment was scrubbed... URL: From hitz at genome.stanford.edu Tue Sep 9 12:13:13 2008 From: hitz at genome.stanford.edu (Benjamin Hitz) Date: Tue, 9 Sep 2008 12:13:13 -0700 Subject: [Gofriends] Error uploading MySQL dump In-Reply-To: <48C67246.6040108@dundee.ac.uk> References: <48C67246.6040108@dundee.ac.uk> Message-ID: Our DBA recommends upgrading to MySQL 5+. Please let us know if this doesn't fix the problem. Ben On Sep 9, 2008, at 5:55 AM, David Martin wrote: > Trying to import the term_synonym table (term_synonym.sql) I get the > following MySQL error: > > ERROR 1170 (42000) at line 16: BLOB/TEXT column 'term_synonym' used > in key specification without a key length > > MySQL version is 4.1.20-log > > ..d > > _______________________________________________ > Gofriends mailing list > Gofriends at geneontology.org > http://fafner.stanford.edu/mailman/listinfo/gofriends -- Ben Hitz Senior Scientific Programmer ** Saccharomyces Genome Database ** GO Consortium Stanford University ** hitz at genome.stanford.edu From gail at genome.stanford.edu Tue Sep 9 12:30:50 2008 From: gail at genome.stanford.edu (Gail Binkley) Date: Tue, 9 Sep 2008 12:30:50 -0700 (PDT) Subject: [Gofriends] Error uploading MySQL dump In-Reply-To: References: <48C67246.6040108@dundee.ac.uk> Message-ID: According to the GO schema documentation (http://www.geneontology.org/GO.database.schema.shtml#go-meta.table.term-synonym) the term_synonym column is a VARCHAR(996), not a Text or BLOB. The reason why you are getting error 1170 is that Mysql doesn't support a text or blob column in an index. The solution is to give the term_synonym column a size limit then try your import. Be sure you have the latest version of the GO schema and it is always good to use the latest stable version of Mysql (which is 5.0.67). Gail Binkley Stanford On Tue, 9 Sep 2008, Benjamin Hitz wrote: > > Our DBA recommends upgrading to MySQL 5+. Please let us know if this doesn't > fix the problem. > > Ben > > On Sep 9, 2008, at 5:55 AM, David Martin wrote: > >> Trying to import the term_synonym table (term_synonym.sql) I get the >> following MySQL error: >> >> ERROR 1170 (42000) at line 16: BLOB/TEXT column 'term_synonym' used in key >> specification without a key length >> >> MySQL version is 4.1.20-log >> >> ..d >> >> _______________________________________________ >> Gofriends mailing list >> Gofriends at geneontology.org >> http://fafner.stanford.edu/mailman/listinfo/gofriends > > -- > Ben Hitz > Senior Scientific Programmer ** Saccharomyces Genome Database ** GO Consortium > Stanford University ** hitz at genome.stanford.edu > > > > _______________________________________________ > Gofriends mailing list > Gofriends at geneontology.org > http://fafner.stanford.edu/mailman/listinfo/gofriends From cherry at stanford.edu Tue Sep 9 12:44:26 2008 From: cherry at stanford.edu (Mike Cherry) Date: Tue, 9 Sep 2008 12:44:26 -0700 Subject: [Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt In-Reply-To: <557C69FA-5A9D-4AAE-B1CA-74822E8D3C8E@hms.harvard.edu> References: <31552965-46E2-46A9-9C76-92C7EE3D179F@hms.harvard.edu> <48C5A292.9030005@informatics.jax.org> <557C69FA-5A9D-4AAE-B1CA-74822E8D3C8E@hms.harvard.edu> Message-ID: <54D08BE8-3FEF-4551-9863-52A0F7DB75D1@stanford.edu> Gabriel, I wouldn't say this is a bug. The 1302948 ID is used by RGD when the annotations have been created by the RGD project. Those annotations that have the ENSEMBL ID ENSRNOP00000034933 have been created by ENSEMBL. RGD is just passing the ENSEMBL annotations through in their file. The gene association file is created by RGD. While some groups do map all the external IDs to internal IDs this is not done by all. One suggestion for your example is to filter out the IEA annotations. That would remove the ENSEMBL associations for this example. You would likely want to do that anyway, or at least compare your statistics with and without the computationally defined annotations. -Mike On Sep 9, 2008, at 10:22 AM, Gabriel Berriz wrote: > On 2008.09.08 Mon, at 18:09, Judith Blake wrote: >> Gabriel, >> >> The gene association files are non-redundant. Primary model >> organisms >> have responsibility for integrating annotations from mulitple sources >> and submitting a non-redundant file to the GOdb. QC checks on the >> files >> also remove redundancies. > > > > Hi, Judy. My word choice was not a very good one when I wrote of > "redundancies", so let me give an example of what I meant. It comes > from the latest gene_association.rgd.gz file. (This example is the > first one I followed up on of the 1000 or so that I mentioned in my > previous email.) > > The latest gene_association.rgd.gz file contains 15 associations for > RGD ID 1302948, and 4 associations for ENSEMBL ID > ENSRNOP00000034933. In fact, according to both Ensembl and RGD (http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948 > ) these two identifiers both refer to the same entity (transforming > acidic coiled-coil containing protein 3, aka Tacc3). Hence, the > file uses two names for the same thing. Why? > > The reason why I bring this problem up is that, in our work, we > compute statistics that are very sensitive to how many genes have a > particular GO attribute, therefore it is crucial for us to count the > associations in this example as being 19 belonging to the same > protein, rather than 15 belonging to one and 4 belonging to > another. This accounting task is made significantly more difficult > by the fact that the association file uses two different names for > the same thing. > > Maybe I'm wrong here, but this looks to me like a bug rather than a > feature: I can't see that any good could come of using multiple > names for the same thing in a document like this. > > If it is indeed a bug, would it be too difficult to fix? I.e. would > it be too difficult for GO and the purveyors of associations files > to use a consistent nomenclature whenever possible? > > If it's of any help with this, we have a tool, called Synergizer, > for bulk mapping of identifiers from one namespace to another, and > it is a simple matter to set up a pipeline to do it automatically > (see http://llama.med.harvard.edu/synergizer/doc). We'd be happy to > help with this in any way we can. (Although I imagine that the > organizations that generate such associations files are the ultimate > experts for resolving such nomenclature issues.) > > Also, as I said earlier, the example above is not isolated. For R. > norvegicus alone there are about 1000, and that's only focusing on > RGD vs. ENSEMBL IDs. And the problem is not limited to R. > norvegicus. Among the organisms that I have analyzed, I found a > similar nomenclature inconsistencies with several others, including > B. taurus, G. gallus, C. elegans, and H. sapiens. > > Thanks for your comments! > > Gabriel Berriz > ============================================================= > Gabriel F. Berriz, PhD > Bioinformatics Developer > Roth Lab > Biological Chemistry and Molecular Pharmacology -- Harvard Medical > School > Seeley G. Mudd Building 322B > Boston, MA 02115-5701 > Telephone: 617.432.3555 > Fax: 617.432.3557 > > > > _______________________________________________ > Gofriends mailing list > Gofriends at geneontology.org > http://fafner.stanford.edu/mailman/listinfo/gofriends From cherry at stanford.edu Tue Sep 9 13:22:24 2008 From: cherry at stanford.edu (Mike Cherry) Date: Tue, 9 Sep 2008 13:22:24 -0700 Subject: [Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt Message-ID: <88773FF0-4940-4AB8-B0E2-564E84338BE3@stanford.edu> There is a change coming to the format of the gene association file which will solve this problem. Annotations to proteins, gene, transcripts, etc for a particular locus will be identified as such. The change should occur in 2009. -Mike > From: "Quaid Morris" > To: "Gabriel Berriz" > Subject: Re: [Gofriends] Redundancy in go_XXXXXX-assocdb-tables/ > dbxref.txt > Cc: gofriends at genome.stanford.edu > > Hi Gabriel, > > It looks like in the example that you gave RGD ID 1302948 is a gene > ID and > ENSRNOP00000034933 is a protein ID. Are all your examples like > this? Maybe > there are circumstances when it's possible to annotate a specific > isoform > and others when only the gene can be annotated. > > Q > From vpetri at mcw.edu Tue Sep 9 13:27:31 2008 From: vpetri at mcw.edu (Petri, Victoria) Date: Tue, 9 Sep 2008 15:27:31 -0500 Subject: [Gofriends] [Fwd: Re: Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt] In-Reply-To: <48C6BCE6.2050204@informatics.jax.org> References: <48C6BCE6.2050204@informatics.jax.org> Message-ID: <1448A38A42714048B9C53E473E13CCF0010A6476@davis.hmgc.mcw.edu> Hi Gabriel, The gene association files are non-redundant. The RGD GO annotations come from two sources: manual annotation of genes and annotations that are brought in electronically from MGI and GOA via QC_based pipelines. For data from GOA for which a match is not found in RGD that information is appended at the end of the gene association file 'as is', or a match is found but the annotation is already in the database for that gene. It is important to keep in mind that GOA annotates proteins rather than genes (which we and other MODs do) and if multiple protein transcripts get the same annotation - which is not a redundancy - one could/would be loaded into the database and the others would be appended at the end of GAF. As Mike has already suggested, I would filter out IEAs which would 1) remove the Ensembl IDs in question and 2) keep in annotations that have been experimentally determined either for rat or for an orthologous gene. If possible I would also compare protein IDs associated with one gene versus Ensembl IDs at the end of the gene association file because of the one-to-many gene-to-protein relationship. Victoria Victoria Petri, Ph.D. Research Scientist Rat Genome Database (http://rgd.mcw.edu) Bioinformatics Program Human and Molecular Genetics Center Medical College of Wisconsin 8701 Watertown Plank Road, Milwaukee, WI 53226 (414) 456-8871 Fax (414) 456-6595 vpetri at mcw.edu vpetri at mail.brc.mcw.edu -----Original Message----- From: Judith Blake [mailto:jblake at informatics.jax.org] Sent: Tuesday, September 09, 2008 1:14 PM To: Shimoyama, Mary Cc: Petri, Victoria Subject: [Fwd: Re: [Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt] Hi Mary, Can you respond here. Is this a curation issue for these organisms? Is mouse not on this list because of the substantial resources we can bring to this project? Judy -------- Original Message -------- Subject: Re: [Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt Date: Tue, 9 Sep 2008 13:22:49 -0400 From: Gabriel Berriz To: Judith Blake CC: References: <31552965-46E2-46A9-9C76-92C7EE3D179F at hms.harvard.edu> <48C5A292.9030005 at informatics.jax.org> On 2008.09.08 Mon, at 18:09, Judith Blake wrote: > Gabriel, > > The gene association files are non-redundant. Primary model organisms > have responsibility for integrating annotations from mulitple sources > and submitting a non-redundant file to the GOdb. QC checks on the files > also remove redundancies. Hi, Judy. My word choice was not a very good one when I wrote of "redundancies", so let me give an example of what I meant. It comes from the latest gene_association.rgd.gz file. (This example is the first one I followed up on of the 1000 or so that I mentioned in my previous email.) The latest gene_association.rgd.gz file contains 15 associations for RGD ID 1302948, and 4 associations for ENSEMBL ID ENSRNOP00000034933. In fact, according to both Ensembl and RGD (http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948) these two identifiers both refer to the same entity (transforming acidic coiled-coil containing protein 3, aka Tacc3). Hence, the file uses two names for the same thing. Why? The reason why I bring this problem up is that, in our work, we compute statistics that are very sensitive to how many genes have a particular GO attribute, therefore it is crucial for us to count the associations in this example as being 19 belonging to the same protein, rather than 15 belonging to one and 4 belonging to another. This accounting task is made significantly more difficult by the fact that the association file uses two different names for the same thing. Maybe I'm wrong here, but this looks to me like a bug rather than a feature: I can't see that any good could come of using multiple names for the same thing in a document like this. If it is indeed a bug, would it be too difficult to fix? I.e. would it be too difficult for GO and the purveyors of associations files to use a consistent nomenclature whenever possible? If it's of any help with this, we have a tool, called Synergizer, for bulk mapping of identifiers from one namespace to another, and it is a simple matter to set up a pipeline to do it automatically (see http://llama.med.harvard.edu/synergizer/doc). We'd be happy to help with this in any way we can. (Although I imagine that the organizations that generate such associations files are the ultimate experts for resolving such nomenclature issues.) Also, as I said earlier, the example above is not isolated. For R. norvegicus alone there are about 1000, and that's only focusing on RGD vs. ENSEMBL IDs. And the problem is not limited to R. norvegicus. Among the organisms that I have analyzed, I found a similar nomenclature inconsistencies with several others, including B. taurus, G. gallus, C. elegans, and H. sapiens. Thanks for your comments! Gabriel Berriz ============================================================= Gabriel F. Berriz, PhD Bioinformatics Developer Roth Lab Biological Chemistry and Molecular Pharmacology -- Harvard Medical School Seeley G. Mudd Building 322B Boston, MA 02115-5701 Telephone: 617.432.3555 Fax: 617.432.3557 -------------- next part -------------- An HTML attachment was scrubbed... URL: From val at sanger.ac.uk Fri Sep 12 01:53:23 2008 From: val at sanger.ac.uk (Valerie Wood) Date: Fri, 12 Sep 2008 09:53:23 +0100 Subject: [Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt In-Reply-To: <54D08BE8-3FEF-4551-9863-52A0F7DB75D1@stanford.edu> References: <31552965-46E2-46A9-9C76-92C7EE3D179F@hms.harvard.edu> <48C5A292.9030005@informatics.jax.org> <557C69FA-5A9D-4AAE-B1CA-74822E8D3C8E@hms.harvard.edu> <54D08BE8-3FEF-4551-9863-52A0F7DB75D1@stanford.edu> Message-ID: <48CA2E03.50801@sanger.ac.uk> All, Some other points maybe worth considering here, 1. Ensembl appear to derive their primary GO data from Uniprot; Uniprot only include a subset of evidence codes which excludes some of the curator assigned annotations from the MODs (including ND, ISS, IC). Wouldn't it be preferable for Ensembl to use the MOD derived curated data removing the need to create many of the IEA mappings? 2. Could UniProt import all of the curated data for the MODs, rather than just a subset, especially for the reference genomes? 3. The Ensembl entry has IEA to DNA binding but Tacc3 does not appear to have DNA binding domains. What is the source of the Ensembl IEA data for Tacc3 (it isn't recorded, the source of this would be useful)? Val Mike Cherry wrote: > Gabriel, > > I wouldn't say this is a bug. The 1302948 ID is used by RGD when the > annotations have been created by the RGD project. Those annotations > that have the ENSEMBL ID ENSRNOP00000034933 have been created by > ENSEMBL. RGD is just passing the ENSEMBL annotations through in their > file. > > The gene association file is created by RGD. While some groups do map > all the external IDs to internal IDs this is not done by all. > > One suggestion for your example is to filter out the IEA annotations. > That would remove the ENSEMBL associations for this example. You > would likely want to do that anyway, or at least compare your > statistics with and without the computationally defined annotations. > > -Mike > > > On Sep 9, 2008, at 10:22 AM, Gabriel Berriz wrote: > >> On 2008.09.08 Mon, at 18:09, Judith Blake wrote: >>> Gabriel, >>> >>> The gene association files are non-redundant. Primary model organisms >>> have responsibility for integrating annotations from mulitple sources >>> and submitting a non-redundant file to the GOdb. QC checks on the >>> files >>> also remove redundancies. >> >> >> >> Hi, Judy. My word choice was not a very good one when I wrote of >> "redundancies", so let me give an example of what I meant. It comes >> from the latest gene_association.rgd.gz file. (This example is the >> first one I followed up on of the 1000 or so that I mentioned in my >> previous email.) >> >> The latest gene_association.rgd.gz file contains 15 associations for >> RGD ID 1302948, and 4 associations for ENSEMBL ID >> ENSRNOP00000034933. In fact, according to both Ensembl and RGD >> (http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948) these two >> identifiers both refer to the same entity (transforming acidic >> coiled-coil containing protein 3, aka Tacc3). Hence, the file uses >> two names for the same thing. Why? >> >> The reason why I bring this problem up is that, in our work, we >> compute statistics that are very sensitive to how many genes have a >> particular GO attribute, therefore it is crucial for us to count the >> associations in this example as being 19 belonging to the same >> protein, rather than 15 belonging to one and 4 belonging to another. >> This accounting task is made significantly more difficult by the fact >> that the association file uses two different names for the same thing. >> >> Maybe I'm wrong here, but this looks to me like a bug rather than a >> feature: I can't see that any good could come of using multiple >> names for the same thing in a document like this. >> >> If it is indeed a bug, would it be too difficult to fix? I.e. would >> it be too difficult for GO and the purveyors of associations files to >> use a consistent nomenclature whenever possible? >> >> If it's of any help with this, we have a tool, called Synergizer, for >> bulk mapping of identifiers from one namespace to another, and it is >> a simple matter to set up a pipeline to do it automatically (see >> http://llama.med.harvard.edu/synergizer/doc). We'd be happy to help >> with this in any way we can. (Although I imagine that the >> organizations that generate such associations files are the ultimate >> experts for resolving such nomenclature issues.) >> >> Also, as I said earlier, the example above is not isolated. For R. >> norvegicus alone there are about 1000, and that's only focusing on >> RGD vs. ENSEMBL IDs. And the problem is not limited to R. >> norvegicus. Among the organisms that I have analyzed, I found a >> similar nomenclature inconsistencies with several others, including >> B. taurus, G. gallus, C. elegans, and H. sapiens. >> >> Thanks for your comments! >> >> Gabriel Berriz >> ============================================================= >> Gabriel F. Berriz, PhD >> Bioinformatics Developer >> Roth Lab >> Biological Chemistry and Molecular Pharmacology -- Harvard Medical >> School >> Seeley G. Mudd Building 322B >> Boston, MA 02115-5701 >> Telephone: 617.432.3555 >> Fax: 617.432.3557 >> >> >> >> _______________________________________________ >> Gofriends mailing list >> Gofriends at geneontology.org >> http://fafner.stanford.edu/mailman/listinfo/gofriends > > _______________________________________________ > Gofriends mailing list > Gofriends at geneontology.org > http://fafner.stanford.edu/mailman/listinfo/gofriends > > > -- --------------------------------------------------------------------------- Valerie Wood Tel: 01223 496909 S. pombe Genome Project Fax: 01223 494919 Wellcome Trust Sanger Institute email: val at sanger.ac.uk Wellcome Trust Genome Campus http://www.genedb.org/genedb/pombe Hinxton, Cambridge, CB10 1HH http://www.sanger.ac.uk/Projects/S_pombe -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From edimmer at ebi.ac.uk Fri Sep 12 03:36:07 2008 From: edimmer at ebi.ac.uk (Emily Dimmer) Date: Fri, 12 Sep 2008 11:36:07 +0100 Subject: [Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt In-Reply-To: <48CA2E03.50801@sanger.ac.uk> References: <31552965-46E2-46A9-9C76-92C7EE3D179F@hms.harvard.edu> <48C5A292.9030005@informatics.jax.org> <557C69FA-5A9D-4AAE-B1CA-74822E8D3C8E@hms.harvard.edu> <54D08BE8-3FEF-4551-9863-52A0F7DB75D1@stanford.edu> <48CA2E03.50801@sanger.ac.uk> Message-ID: <48CA4617.6050709@ebi.ac.uk> Having just spoken to Ensembl they do generally take annotations from MOD files on the GO Consoritum site and then supplement these annotations with those that GOA provides. They also appear to take annotations for all evidence codes. However for the Ensembl Compara IEA method, which makes use of the 1:1 and apparent 1:1 orthology information, annotations are projected using the same kinds of criteria that we use to project annotations via ISS - i.e. only IDA, IMP, IEP, IGI and IPI annotations are transferred. Further information is located here: http://www.ebi.ac.uk/GOA/compara_go_annotations.html However! in the case of rat, it does appear that Ensembl have not been taking the RGD association file, only the GOA rat file. This is probably because Ensembl relies on UniProtKB to RGD id mappings, and currently UniProtKB does not have an entry for Tacc3. Therefore the only annotations that Ensembl is displaying are those generated from the Ensembl Compara projection method - so these annotations will have originated from the human or mouse orthologs. Please also note that there can be quite a long gap between GO cross-reference updates at Ensembl - they are not able to update on a monthly basis, so the annotation sets you are seeing, could be a number of months old. On the GOA front - we take all MOD annotations which map to UniProtKB accessions, and which have an evidence code other than IEA or ISS (so we do take ND and IC coded annotations). The ISS exclusion is a decision one we are revisiting, historically it was decided to exclude these to avoid any potential circular ISS annotations, however I think that there ISS annotation sets we should now be taking in and with which we shouldn't have any problems. I do agree that Ensembl should be displaying additional information in their GO cross-references, (including references, sources etc). They are intending to to revise their cross-references shortly, and will look into this further. Emily Valerie Wood wrote: > > All, > > Some other points maybe worth considering here, > > > 1. Ensembl appear to derive their primary GO data from Uniprot; > Uniprot only include a subset of evidence codes which excludes some of > the curator assigned annotations from the MODs (including ND, ISS, > IC). Wouldn't it be preferable for Ensembl to use the MOD derived > curated data removing the need to create many of the IEA mappings? > > 2. Could UniProt import all of the curated data for the MODs, rather > than just a subset, especially for the reference genomes? > > 3. The Ensembl entry has IEA to DNA binding but Tacc3 does not appear > to have DNA binding domains. What is the source of the Ensembl IEA > data for Tacc3 (it isn't recorded, the source of this would be useful)? > > Val > > > Mike Cherry wrote: >> Gabriel, >> >> I wouldn't say this is a bug. The 1302948 ID is used by RGD when the >> annotations have been created by the RGD project. Those annotations >> that have the ENSEMBL ID ENSRNOP00000034933 have been created by >> ENSEMBL. RGD is just passing the ENSEMBL annotations through in >> their file. >> >> The gene association file is created by RGD. While some groups do >> map all the external IDs to internal IDs this is not done by all. >> >> One suggestion for your example is to filter out the IEA >> annotations. That would remove the ENSEMBL associations for this >> example. You would likely want to do that anyway, or at least >> compare your statistics with and without the computationally defined >> annotations. >> >> -Mike >> >> >> On Sep 9, 2008, at 10:22 AM, Gabriel Berriz wrote: >> >>> On 2008.09.08 Mon, at 18:09, Judith Blake wrote: >>>> Gabriel, >>>> >>>> The gene association files are non-redundant. Primary model organisms >>>> have responsibility for integrating annotations from mulitple sources >>>> and submitting a non-redundant file to the GOdb. QC checks on the >>>> files >>>> also remove redundancies. >>> >>> >>> >>> Hi, Judy. My word choice was not a very good one when I wrote of >>> "redundancies", so let me give an example of what I meant. It comes >>> from the latest gene_association.rgd.gz file. (This example is the >>> first one I followed up on of the 1000 or so that I mentioned in my >>> previous email.) >>> >>> The latest gene_association.rgd.gz file contains 15 associations for >>> RGD ID 1302948, and 4 associations for ENSEMBL ID >>> ENSRNOP00000034933. In fact, according to both Ensembl and RGD >>> (http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948) these two >>> identifiers both refer to the same entity (transforming acidic >>> coiled-coil containing protein 3, aka Tacc3). Hence, the file uses >>> two names for the same thing. Why? >>> >>> The reason why I bring this problem up is that, in our work, we >>> compute statistics that are very sensitive to how many genes have a >>> particular GO attribute, therefore it is crucial for us to count the >>> associations in this example as being 19 belonging to the same >>> protein, rather than 15 belonging to one and 4 belonging to >>> another. This accounting task is made significantly more difficult >>> by the fact that the association file uses two different names for >>> the same thing. >>> >>> Maybe I'm wrong here, but this looks to me like a bug rather than a >>> feature: I can't see that any good could come of using multiple >>> names for the same thing in a document like this. >>> >>> If it is indeed a bug, would it be too difficult to fix? I.e. would >>> it be too difficult for GO and the purveyors of associations files >>> to use a consistent nomenclature whenever possible? >>> >>> If it's of any help with this, we have a tool, called Synergizer, >>> for bulk mapping of identifiers from one namespace to another, and >>> it is a simple matter to set up a pipeline to do it automatically >>> (see http://llama.med.harvard.edu/synergizer/doc). We'd be happy to >>> help with this in any way we can. (Although I imagine that the >>> organizations that generate such associations files are the ultimate >>> experts for resolving such nomenclature issues.) >>> >>> Also, as I said earlier, the example above is not isolated. For R. >>> norvegicus alone there are about 1000, and that's only focusing on >>> RGD vs. ENSEMBL IDs. And the problem is not limited to R. >>> norvegicus. Among the organisms that I have analyzed, I found a >>> similar nomenclature inconsistencies with several others, including >>> B. taurus, G. gallus, C. elegans, and H. sapiens. >>> >>> Thanks for your comments! >>> >>> Gabriel Berriz >>> ============================================================= >>> Gabriel F. Berriz, PhD >>> Bioinformatics Developer >>> Roth Lab >>> Biological Chemistry and Molecular Pharmacology -- Harvard Medical >>> School >>> Seeley G. Mudd Building 322B >>> Boston, MA 02115-5701 >>> Telephone: 617.432.3555 >>> Fax: 617.432.3557 >>> >>> >>> >>> _______________________________________________ >>> Gofriends mailing list >>> Gofriends at geneontology.org >>> http://fafner.stanford.edu/mailman/listinfo/gofriends >> >> _______________________________________________ >> Gofriends mailing list >> Gofriends at geneontology.org >> http://fafner.stanford.edu/mailman/listinfo/gofriends >> >> >> > > -- Do you need any additional GO annotation resources? Which proteins would you like annotated with GO? Let us know in the GOA User Survey, available at: http://www.ebi.ac.uk/GOA/contactus.html ------------------------------------------------------------------ Emily Dimmer Ph.D. GOA Coordinator EMBL-EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD, U.K. Tel: +44 1223 494654 Fax: +44 1223 494468 email: edimmer at ebi.ac.uk URL: http://www.ebi.ac.uk/goa From dbarrell at ebi.ac.uk Thu Sep 18 08:06:40 2008 From: dbarrell at ebi.ac.uk (Daniel Barrell) Date: Thu, 18 Sep 2008 16:06:40 +0100 Subject: [Gofriends] September 2008 GOA release In-Reply-To: <42690FC8.8010409@ebi.ac.uk> References: <42690FC8.8010409@ebi.ac.uk> Message-ID: <48D26E80.2000909@ebi.ac.uk> GOA releases: September 2008 ============================ GOA (GO Annotation at EBI) is a project run by the European Bioinformatics Institute that aims to provide assignments of gene products to the Gene Ontology (GO) resource. The data can be obtained via: EBI FTP: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/ EBI SRS: http://srs.ebi.ac.uk. Search GOA data library GO FTP: ftp://ftp.geneontology.org/pub/go/gene-associations/ GO CVS: http://www.geneontology.org/GO.CVS.help.html (the last two will be updated overnight) For further information read: http://www.ebi.ac.uk/GOA or contact goa at ebi.ac.uk. NEWS: ===== **GOA gene association file format changes** The GOA group has made changes to the contents of columns 3, 10 and 11 in all GOA gene association files except gene_association.goa_pdb.gz. The changes are outlined below and will ensure that the format of the affected columns is in line with GO Consortium requirements. While the ordering of identifiers in these columns will change, no identifiers will be removed. *Column 3 (DB_Object_Symbol)* Previous content: UniProt identifier (e.g. PRG4_HUMAN) New content: Primary gene symbol when available (e.g. PRG4), otherwise contains locus name or will repeat the value present in column 2 (either a UniProtKB accession, IPI, Ensembl, VEGA, HINV, TAIR or RefSeq peptide identifier). *Column 10 (DB_Object_Name)* Previous content: A list of gene symbols and protein name. (e.g. PRG4, MSF, SZP: Proteoglycan-4 precursor) New content: protein name only when available (e.g. Proteoglycan-4 precursor). Otherwise left blank. Column 11 (Synonym) Previous content: IPI identifiers when available (e.g. IPI00024825) New content: A pipe-delimited list of alternative gene symbol synonyms, IPI and UniProtKB identifiers (e.g. MSF|SZP|IPI00024825|PRG4_HUMAN). ***New annotation source: Human Protein Atlas*** GOA now contains human protein subcellular location experimental data from immunofluorescence studies. See the Human Protein Atlas website: http://www.proteinatlas.org/data/go_if_loc.php Regards The UniProt GOA Team From mgiglio at som.umaryland.edu Wed Sep 24 10:59:33 2008 From: mgiglio at som.umaryland.edu (Gwinn Giglio, Michelle) Date: Wed, 24 Sep 2008 13:59:33 -0400 Subject: [Gofriends] Dual-Taxon annotations in GO gene association files In-Reply-To: Message-ID: Dear GO Friends, Symbiotic interactions include a wide range of relationships between organisms, some are beneficial (e.g. mutualism) and some are not (e.g. parasitism). During symbiotic interactions many gene products are required to carry out the functions and processes required for initiating and maintaining the interaction. It is often the case that one organism may be able to interact with many different species and that specific gene products in that organism are required for interactions with one symbiotic partner, but not others. In order to fully capture the complexities of the interactions between species, it is desirable to be able to store not only the taxon id of the organism the gene product comes from but also the taxon id of the organism the gene product interacts with. Therefore, the GO adopted "Dual-Taxon" annotations for annotating interactions between organisms. In a Dual-Taxon annotation, two taxon ids are placed in the taxon column of the gene association file and separated by a pipe. The first id represents the organism encoding the gene product and the second id represents the organism with which the gene product interacts. Although the ability to make such annotations has existed in the GO for several years, it is only in the last 6 months that dual-taxon annotations have begun to appear in GO gene association files. There are now 3 association files which contain dual-taxon annotations: Magnoporthe grisea, Oomycetes, and Pseudomonas syringae. These annotations are the result of efforts by the PAMGO (Plant-Associated Microbe Gene Ontology) Project. More inforamtion on Dual-Taxon annotations can be found here: http://www.geneontology.org/GO.annotation.conventions.shtml#interactions More information on the PAMGO project can be found here: http://pamgo.vbi.vt.edu The dual-taxon annotations are not yet visible in the GO database or AmiGO, but they will be arriving soon. Please let us know if you have any questions, Michelle -------------- next part -------------- An HTML attachment was scrubbed... URL: From gopinath at cshl.edu Mon Sep 22 10:15:31 2008 From: gopinath at cshl.edu (Gopal Gopinathrao) Date: Mon, 22 Sep 2008 13:15:31 -0400 Subject: [Gofriends] [Reactome-announce] Reactome curator position at EBI Message-ID: <4E4C6F0F-88B6-4738-9E23-95CAF6370B86@cshl.edu> Interested in working as a Reactome curator? The Reactome project (www.reactome.org) is a collaboration among Cold Spring Harbor Laboratory, The European Bioinformatics Institute, and The Gene Ontology Consortium to develop a curated resource of core pathways and reactions in human biology. The information in this database is authored by biological researchers with expertise in their fields, maintained by the Reactome editorial staff, and cross- referenced with the sequence databases at NCBI, Ensembl and UniProt, the UCSC Genome Browser, HapMap, KEGG(Gene and Compound ), ChEBI, PubMed and GO. We currently have a job opening at the EBI, Cambridge, UK. For more information and application details, see: http://www-db.embl.de/jss/servlet/de.embl.bk.emblGroups.JobsPage/ 08062EBI.html Please forward this to anyone who might be interested in this position. Many thanks, The Reactome Team -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- _______________________________________________ Reactome-announce mailing list Reactome-announce at reactome.org http://mail.reactome.org/mailman/listinfo/reactome-announce