[Gofriends] Redundancy in go_XXXXXX-assocdb-tables/dbxref.txt
Emily Dimmer
edimmer at ebi.ac.uk
Fri Sep 12 03:36:07 PDT 2008
Having just spoken to Ensembl they do generally take annotations from
MOD files on the GO Consoritum site and then supplement these
annotations with those that GOA provides. They also appear to take
annotations for all evidence codes. However for the Ensembl Compara IEA
method, which makes use of the 1:1 and apparent 1:1 orthology
information, annotations are projected using the same kinds of criteria
that we use to project annotations via ISS - i.e. only IDA, IMP, IEP,
IGI and IPI annotations are transferred. Further information is located
here: http://www.ebi.ac.uk/GOA/compara_go_annotations.html
However! in the case of rat, it does appear that Ensembl have not been
taking the RGD association file, only the GOA rat file. This is probably
because Ensembl relies on UniProtKB to RGD id mappings, and currently
UniProtKB does not have an entry for Tacc3. Therefore the only
annotations that Ensembl is displaying are those generated from the
Ensembl Compara projection method - so these annotations will have
originated from the human or mouse orthologs. Please also note that
there can be quite a long gap between GO cross-reference updates at
Ensembl - they are not able to update on a monthly basis, so the
annotation sets you are seeing, could be a number of months old.
On the GOA front - we take all MOD annotations which map to UniProtKB
accessions, and which have an evidence code other than IEA or ISS (so we
do take ND and IC coded annotations). The ISS exclusion is a decision
one we are revisiting, historically it was decided to exclude these to
avoid any potential circular ISS annotations, however I think that there
ISS annotation sets we should now be taking in and with which we
shouldn't have any problems.
I do agree that Ensembl should be displaying additional information in
their GO cross-references, (including references, sources etc). They are
intending to to revise their cross-references shortly, and will look
into this further.
Emily
Valerie Wood wrote:
>
> All,
>
> Some other points maybe worth considering here,
>
>
> 1. Ensembl appear to derive their primary GO data from Uniprot;
> Uniprot only include a subset of evidence codes which excludes some of
> the curator assigned annotations from the MODs (including ND, ISS,
> IC). Wouldn't it be preferable for Ensembl to use the MOD derived
> curated data removing the need to create many of the IEA mappings?
>
> 2. Could UniProt import all of the curated data for the MODs, rather
> than just a subset, especially for the reference genomes?
>
> 3. The Ensembl entry has IEA to DNA binding but Tacc3 does not appear
> to have DNA binding domains. What is the source of the Ensembl IEA
> data for Tacc3 (it isn't recorded, the source of this would be useful)?
>
> Val
>
>
> Mike Cherry wrote:
>> Gabriel,
>>
>> I wouldn't say this is a bug. The 1302948 ID is used by RGD when the
>> annotations have been created by the RGD project. Those annotations
>> that have the ENSEMBL ID ENSRNOP00000034933 have been created by
>> ENSEMBL. RGD is just passing the ENSEMBL annotations through in
>> their file.
>>
>> The gene association file is created by RGD. While some groups do
>> map all the external IDs to internal IDs this is not done by all.
>>
>> One suggestion for your example is to filter out the IEA
>> annotations. That would remove the ENSEMBL associations for this
>> example. You would likely want to do that anyway, or at least
>> compare your statistics with and without the computationally defined
>> annotations.
>>
>> -Mike
>>
>>
>> On Sep 9, 2008, at 10:22 AM, Gabriel Berriz wrote:
>>
>>> On 2008.09.08 Mon, at 18:09, Judith Blake wrote:
>>>> Gabriel,
>>>>
>>>> The gene association files are non-redundant. Primary model organisms
>>>> have responsibility for integrating annotations from mulitple sources
>>>> and submitting a non-redundant file to the GOdb. QC checks on the
>>>> files
>>>> also remove redundancies.
>>>
>>>
>>>
>>> Hi, Judy. My word choice was not a very good one when I wrote of
>>> "redundancies", so let me give an example of what I meant. It comes
>>> from the latest gene_association.rgd.gz file. (This example is the
>>> first one I followed up on of the 1000 or so that I mentioned in my
>>> previous email.)
>>>
>>> The latest gene_association.rgd.gz file contains 15 associations for
>>> RGD ID 1302948, and 4 associations for ENSEMBL ID
>>> ENSRNOP00000034933. In fact, according to both Ensembl and RGD
>>> (http://rgd.mcw.edu/tools/genes/genes_view.cgi?id=1302948) these two
>>> identifiers both refer to the same entity (transforming acidic
>>> coiled-coil containing protein 3, aka Tacc3). Hence, the file uses
>>> two names for the same thing. Why?
>>>
>>> The reason why I bring this problem up is that, in our work, we
>>> compute statistics that are very sensitive to how many genes have a
>>> particular GO attribute, therefore it is crucial for us to count the
>>> associations in this example as being 19 belonging to the same
>>> protein, rather than 15 belonging to one and 4 belonging to
>>> another. This accounting task is made significantly more difficult
>>> by the fact that the association file uses two different names for
>>> the same thing.
>>>
>>> Maybe I'm wrong here, but this looks to me like a bug rather than a
>>> feature: I can't see that any good could come of using multiple
>>> names for the same thing in a document like this.
>>>
>>> If it is indeed a bug, would it be too difficult to fix? I.e. would
>>> it be too difficult for GO and the purveyors of associations files
>>> to use a consistent nomenclature whenever possible?
>>>
>>> If it's of any help with this, we have a tool, called Synergizer,
>>> for bulk mapping of identifiers from one namespace to another, and
>>> it is a simple matter to set up a pipeline to do it automatically
>>> (see http://llama.med.harvard.edu/synergizer/doc). We'd be happy to
>>> help with this in any way we can. (Although I imagine that the
>>> organizations that generate such associations files are the ultimate
>>> experts for resolving such nomenclature issues.)
>>>
>>> Also, as I said earlier, the example above is not isolated. For R.
>>> norvegicus alone there are about 1000, and that's only focusing on
>>> RGD vs. ENSEMBL IDs. And the problem is not limited to R.
>>> norvegicus. Among the organisms that I have analyzed, I found a
>>> similar nomenclature inconsistencies with several others, including
>>> B. taurus, G. gallus, C. elegans, and H. sapiens.
>>>
>>> Thanks for your comments!
>>>
>>> Gabriel Berriz
>>> =============================================================
>>> Gabriel F. Berriz, PhD
>>> Bioinformatics Developer
>>> Roth Lab
>>> Biological Chemistry and Molecular Pharmacology -- Harvard Medical
>>> School
>>> Seeley G. Mudd Building 322B
>>> Boston, MA 02115-5701
>>> Telephone: 617.432.3555
>>> Fax: 617.432.3557
>>>
>>>
>>>
>>> _______________________________________________
>>> Gofriends mailing list
>>> Gofriends at geneontology.org
>>> http://fafner.stanford.edu/mailman/listinfo/gofriends
>>
>> _______________________________________________
>> Gofriends mailing list
>> Gofriends at geneontology.org
>> http://fafner.stanford.edu/mailman/listinfo/gofriends
>>
>>
>>
>
>
--
Do you need any additional GO annotation resources?
Which proteins would you like annotated with GO?
Let us know in the GOA User Survey, available at: http://www.ebi.ac.uk/GOA/contactus.html
------------------------------------------------------------------
Emily Dimmer Ph.D.
GOA Coordinator
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD, U.K.
Tel: +44 1223 494654
Fax: +44 1223 494468
email: edimmer at ebi.ac.uk
URL: http://www.ebi.ac.uk/goa
More information about the Gofriends
mailing list