[Go] [Gofriends] [Fwd: Re: Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt]

Chris Mungall cjm at berkeleybop.org
Tue Sep 9 15:34:22 PDT 2008


[redirecting again]

This is one of the things motivating the proposed change here:

	http://wiki.geneontology.org/index.php/GAF_Spliceform_Column_Proposal

Whilst technically there is no redundancy, it forces GAF consumers to  
perform additional transformations on the GAF to get a set of  
annotations that do not have genes redundantly represented

On Sep 9, 2008, at 1:27 PM, Petri, Victoria wrote:

>
>
> Hi Gabriel,
>
> The gene association files are non-redundant.
>
> The RGD GO annotations come from two sources: manual annotation of  
> genes and annotations that are brought in electronically from MGI  
> and GOA via QC_based pipelines.
>
> For data from GOA for which a match is not found in RGD that  
> information is appended at the end of the gene association file 'as  
> is', or a match is found but the annotation is already in the  
> database for that gene. It is important to keep in mind that GOA  
> annotates proteins rather than genes (which we and other MODs do)  
> and if multiple protein transcripts get the same annotation - which  
> is not a redundancy - one could/would be loaded into the database  
> and the others would be appended at the end of GAF.
>
> As Mike has already suggested, I would filter out IEAs which would  
> 1) remove the Ensembl IDs in question and 2) keep in annotations  
> that have been experimentally determined either for rat or for an  
> orthologous gene. If possible I would also compare protein IDs  
> associated with one gene versus Ensembl IDs at the end of the gene  
> association file because of the one-to-many gene-to-protein  
> relationship.
>
> Victoria
>
> Victoria Petri, Ph.D.
> Research Scientist
> Rat Genome Database
> (http://rgd.mcw.edu)
> Bioinformatics Program
> Human and Molecular Genetics Center
> Medical College of Wisconsin
> 8701 Watertown Plank Road, Milwaukee, WI 53226
> (414) 456-8871
> Fax (414) 456-6595
> vpetri at mcw.edu
> vpetri at mail.brc.mcw.edu
>
>
> -----Original Message-----
> From: Judith Blake [mailto:jblake at informatics.jax.org]
> Sent: Tuesday, September 09, 2008 1:14 PM
> To: Shimoyama, Mary
> Cc: Petri, Victoria
> Subject: [Fwd: Re: [Gofriends] Redundancy in go_XXXXXX-assocdb- 
> tables/dbxref.txt]
>
> Hi Mary,
>
> Can you respond here.  Is this  a curation issue for these organisms?
> Is mouse not on this list because of the substantial resources we can
> bring to this project?
>
> Judy
>
> -------- Original Message --------
> Subject:    Re: [Gofriends] Redundancy in go_XXXXXX-assocdb-tables/ 
> dbxref.txt
> Date:       Tue, 9 Sep 2008 13:22:49 -0400
> From:       Gabriel Berriz <gberriz at hms.harvard.edu>
> To:   Judith Blake <jblake at informatics.jax.org>
> CC:   <gofriends at genome.stanford.edu>
> References:       <31552965-46E2-46A9-9C76-92C7EE3D179F at hms.harvard.edu 
> >
> <48C5A292.9030005 at informatics.jax.org>
>
>
>
> On 2008.09.08 Mon, at 18:09, Judith Blake wrote:
> > Gabriel,
> >
> > The gene association files are non-redundant.  Primary model  
> organisms
> > have responsibility for integrating annotations from mulitple  
> sources
> > and submitting a non-redundant file to the GOdb.  QC checks on the  
> files
> > also remove redundancies.
>
>
> Hi, Judy.  My word choice was not a very good one when I wrote of
> "redundancies", so let me give an example of what I meant.  It comes
> from the latest gene_association.rgd.gz file.  (This example is the
> first one I followed up on of the 1000 or so that I mentioned in my
> previous email.)
>
> The latest gene_association.rgd.gz file contains 15 associations for  
> RGD
> ID 1302948, and 4 associations for ENSEMBL ID ENSRNOP00000034933.  In
> fact, according to both Ensembl and RGD
> (http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948) these
> two identifiers both refer to the same entity (transforming acidic
> coiled-coil containing protein 3, aka Tacc3).  Hence, the file uses  
> two
> names for the same thing.  Why?
>
> The reason why I bring this problem up is that, in our work, we  
> compute
> statistics that are very sensitive to how many genes have a particular
> GO attribute, therefore it is crucial for us to count the associations
> in this example as being 19 belonging to the same protein, rather than
> 15 belonging to one and 4 belonging to another.  This accounting  
> task is
> made significantly more difficult by the fact that the association  
> file
> uses two different names for the same thing.
>
> Maybe I'm wrong here, but this looks to me like a bug rather than a
> feature:  I can't see that any good could come of using multiple names
> for the same thing in a document like this.
>
> If it is indeed a bug, would it be too difficult to fix?  I.e. would  
> it
> be too difficult for GO and the purveyors of associations files to  
> use a
> consistent nomenclature whenever possible?
>
> If it's of any help with this, we have a tool, called Synergizer, for
> bulk mapping of identifiers from one namespace to another, and it is a
> simple matter to set up a pipeline to do it automatically (see
> http://llama.med.harvard.edu/synergizer/doc).  We'd be happy to help
> with this in any way we can.  (Although I imagine that the  
> organizations
> that generate such associations files are the ultimate experts for
> resolving such nomenclature issues.)
>
> Also, as I said earlier, the example above is not isolated.  For R.
> norvegicus alone there are about 1000, and that's only focusing on RGD
> vs. ENSEMBL IDs.  And the problem is not limited to R. norvegicus.
>  Among the organisms that I have analyzed, I found a similar
> nomenclature inconsistencies with several others, including B. taurus,
> G. gallus, C. elegans, and H. sapiens.
>
> Thanks for your comments!
>
> Gabriel Berriz
> =============================================================
> Gabriel F. Berriz, PhD
> Bioinformatics Developer
> Roth Lab
> Biological Chemistry and Molecular Pharmacology -- Harvard Medical  
> School
> Seeley G. Mudd Building 322B
> Boston, MA 02115-5701
> Telephone: 617.432.3555
> Fax: 617.432.3557
>
>
>
>
> _______________________________________________
> Gofriends mailing list
> Gofriends at geneontology.org
> http://fafner.stanford.edu/mailman/listinfo/gofriends



More information about the Go mailing list