[Go] [Gofriends] [Fwd: Re: Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt]
Chris Mungall
cjm at berkeleybop.org
Tue Sep 9 15:34:22 PDT 2008
[redirecting again]
This is one of the things motivating the proposed change here:
http://wiki.geneontology.org/index.php/GAF_Spliceform_Column_Proposal
Whilst technically there is no redundancy, it forces GAF consumers to
perform additional transformations on the GAF to get a set of
annotations that do not have genes redundantly represented
On Sep 9, 2008, at 1:27 PM, Petri, Victoria wrote:
>
>
> Hi Gabriel,
>
> The gene association files are non-redundant.
>
> The RGD GO annotations come from two sources: manual annotation of
> genes and annotations that are brought in electronically from MGI
> and GOA via QC_based pipelines.
>
> For data from GOA for which a match is not found in RGD that
> information is appended at the end of the gene association file 'as
> is', or a match is found but the annotation is already in the
> database for that gene. It is important to keep in mind that GOA
> annotates proteins rather than genes (which we and other MODs do)
> and if multiple protein transcripts get the same annotation - which
> is not a redundancy - one could/would be loaded into the database
> and the others would be appended at the end of GAF.
>
> As Mike has already suggested, I would filter out IEAs which would
> 1) remove the Ensembl IDs in question and 2) keep in annotations
> that have been experimentally determined either for rat or for an
> orthologous gene. If possible I would also compare protein IDs
> associated with one gene versus Ensembl IDs at the end of the gene
> association file because of the one-to-many gene-to-protein
> relationship.
>
> Victoria
>
> Victoria Petri, Ph.D.
> Research Scientist
> Rat Genome Database
> (http://rgd.mcw.edu)
> Bioinformatics Program
> Human and Molecular Genetics Center
> Medical College of Wisconsin
> 8701 Watertown Plank Road, Milwaukee, WI 53226
> (414) 456-8871
> Fax (414) 456-6595
> vpetri at mcw.edu
> vpetri at mail.brc.mcw.edu
>
>
> -----Original Message-----
> From: Judith Blake [mailto:jblake at informatics.jax.org]
> Sent: Tuesday, September 09, 2008 1:14 PM
> To: Shimoyama, Mary
> Cc: Petri, Victoria
> Subject: [Fwd: Re: [Gofriends] Redundancy in go_XXXXXX-assocdb-
> tables/dbxref.txt]
>
> Hi Mary,
>
> Can you respond here. Is this a curation issue for these organisms?
> Is mouse not on this list because of the substantial resources we can
> bring to this project?
>
> Judy
>
> -------- Original Message --------
> Subject: Re: [Gofriends] Redundancy in go_XXXXXX-assocdb-tables/
> dbxref.txt
> Date: Tue, 9 Sep 2008 13:22:49 -0400
> From: Gabriel Berriz <gberriz at hms.harvard.edu>
> To: Judith Blake <jblake at informatics.jax.org>
> CC: <gofriends at genome.stanford.edu>
> References: <31552965-46E2-46A9-9C76-92C7EE3D179F at hms.harvard.edu
> >
> <48C5A292.9030005 at informatics.jax.org>
>
>
>
> On 2008.09.08 Mon, at 18:09, Judith Blake wrote:
> > Gabriel,
> >
> > The gene association files are non-redundant. Primary model
> organisms
> > have responsibility for integrating annotations from mulitple
> sources
> > and submitting a non-redundant file to the GOdb. QC checks on the
> files
> > also remove redundancies.
>
>
> Hi, Judy. My word choice was not a very good one when I wrote of
> "redundancies", so let me give an example of what I meant. It comes
> from the latest gene_association.rgd.gz file. (This example is the
> first one I followed up on of the 1000 or so that I mentioned in my
> previous email.)
>
> The latest gene_association.rgd.gz file contains 15 associations for
> RGD
> ID 1302948, and 4 associations for ENSEMBL ID ENSRNOP00000034933. In
> fact, according to both Ensembl and RGD
> (http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948) these
> two identifiers both refer to the same entity (transforming acidic
> coiled-coil containing protein 3, aka Tacc3). Hence, the file uses
> two
> names for the same thing. Why?
>
> The reason why I bring this problem up is that, in our work, we
> compute
> statistics that are very sensitive to how many genes have a particular
> GO attribute, therefore it is crucial for us to count the associations
> in this example as being 19 belonging to the same protein, rather than
> 15 belonging to one and 4 belonging to another. This accounting
> task is
> made significantly more difficult by the fact that the association
> file
> uses two different names for the same thing.
>
> Maybe I'm wrong here, but this looks to me like a bug rather than a
> feature: I can't see that any good could come of using multiple names
> for the same thing in a document like this.
>
> If it is indeed a bug, would it be too difficult to fix? I.e. would
> it
> be too difficult for GO and the purveyors of associations files to
> use a
> consistent nomenclature whenever possible?
>
> If it's of any help with this, we have a tool, called Synergizer, for
> bulk mapping of identifiers from one namespace to another, and it is a
> simple matter to set up a pipeline to do it automatically (see
> http://llama.med.harvard.edu/synergizer/doc). We'd be happy to help
> with this in any way we can. (Although I imagine that the
> organizations
> that generate such associations files are the ultimate experts for
> resolving such nomenclature issues.)
>
> Also, as I said earlier, the example above is not isolated. For R.
> norvegicus alone there are about 1000, and that's only focusing on RGD
> vs. ENSEMBL IDs. And the problem is not limited to R. norvegicus.
> Among the organisms that I have analyzed, I found a similar
> nomenclature inconsistencies with several others, including B. taurus,
> G. gallus, C. elegans, and H. sapiens.
>
> Thanks for your comments!
>
> Gabriel Berriz
> =============================================================
> Gabriel F. Berriz, PhD
> Bioinformatics Developer
> Roth Lab
> Biological Chemistry and Molecular Pharmacology -- Harvard Medical
> School
> Seeley G. Mudd Building 322B
> Boston, MA 02115-5701
> Telephone: 617.432.3555
> Fax: 617.432.3557
>
>
>
>
> _______________________________________________
> Gofriends mailing list
> Gofriends at geneontology.org
> http://fafner.stanford.edu/mailman/listinfo/gofriends
More information about the Go
mailing list