[Go] Change for GOA UniProtKB GAF file./ A suggestion

Benjamin Hitz hitz at genome.stanford.edu
Fri May 8 10:52:31 PDT 2009


I would recommend getting it directly from EBI.

Ben

On May 8, 2009, at 8:52 AM, Jim Hu wrote:

> Hi Mike,
>
> We were planning to use the UniProtKB file from the submissions  
> directory to pull some IEAs into GONUTS.  Will there still be a way  
> to do that?
>
> Thanks
>
> Jim
>
>
> On May 8, 2009, at 8:49 AM, Mike Cherry wrote:
>
>> This change is all about allows IEAs from all the GAF files into  
>> AmiGO, except the IEAs that are in the UniProtKB file.  Their are  
>> too many IEAs in UniProtKB file for AmiGO and the GO database to  
>> provide a reasonable return.  Actually this is not all about the  
>> IEAs.  A big part of this is getting the UniProtKB file out of CVS  
>> as its too big for that system.  With the change the GO DB loading  
>> can use all the GAFs in CVS and load all their annotations,  
>> including IEAs
>>
>> -Mike
>>
>>
>> On May 8, 2009, at 5:12 AM, Valerie Wood wrote:
>>
>>> Correction, there are only 44104 IEA mappings for pombe not 55939  
>>> but all of the other numbers are correct (my taxon ID is a  
>>> substring of other taxon IDs....).
>>>
>>> Valerie Wood wrote:
>>>
>>>>
>>>> Slightly related, what is the long term strategy for getting IEA  
>>>> data into AmiGO?
>>>> A the main problem is the volume of annotations I have a  
>>>> suggestion:
>>>>
>>>> For pombe we only include the IEA mappings in the data set  
>>>> provided to GO when they are non redundant with existing  
>>>> annotations.
>>>> In 2006 there were ~30000 electronic mappings, and ~15000 were  
>>>> retained
>>>> Today there are 55939 mappings and 4686 are retained.
>>>>
>>>> For example tim44 has the following mappings:
>>>> From IPR007379
>>>> Process       GO:0006886 intracellular protein transport
>>>> Function     GO:0015450 P-P-bond-hydrolysis-driven protein  
>>>> transmembrane transporter activity
>>>> Component     GO:0005744 mitochondrial inner membrane presequence  
>>>> translocase complex
>>>> From IPR005682
>>>> GO:0006886 intracellular protein transport
>>>> Function     GO:0015450 P-P-bond-hydrolysis-driven protein  
>>>> transmembrane transporter activity
>>>> Component     GO:0005744 mitochondrial inner membrane presequence  
>>>> translocase complex
>>>> From SP-KW
>>>> intracellular protein transmembrane transport
>>>> ATP binding
>>>>
>>>> Only the mapping to ATP binding is retained as all of the others  
>>>> are covered by the manual annotation
>>>>
>>>> Other genes have many more redundent mappings, for example top2  
>>>> has 60 mappings including 7 Interpro domains mapping to the same  
>>>> GO:0003677. These 60 mappings are fully represented by the 12  
>>>> manual experimental annotation.
>>>>
>>>> This procedure has a number of advantages
>>>>
>>>> i) Clearer for Users
>>>> It removes a massive over-presentation of data to the user.
>>>> I cannot see any major advantage in presenting  redundant  
>>>> mappings to the user.
>>>>
>>>> ii) Quality control.
>>>> Because the curator is not presented with so many mappings, and  
>>>> complete annotation should, in theory, cover  the mappings  
>>>> (except in a minority of cased, it should be  possible to make an  
>>>> ISS to a characterised ortholog).
>>>> By following this annotation protocol, spurious mappings are  
>>>> easily identified and can be filtered and fixed.
>>>> Many 100's of mappings have been fixed in this way
>>>> http://sourceforge.net/tracker/?atid=605890&group_id=36855&func=browse
>>>> This also alters to problems in the ontology files,  if a parent  
>>>> is accidently removed, and this parent contains a valid mapping,  
>>>> the annotation will 'reappear', alterting the curator to problems  
>>>> with the ontology (this doesn't happen very often but it does  
>>>> provide an addition layer of QC)
>>>>
>>>> ii) Space
>>>> It would generally reduce the size of the mapping file.
>>>> I have no idea of the size reduction. The reduction for pombe is  
>>>> > 90% but this is because the annotation coverage is high.
>>>> However, even un-annotated organisms could have an associated  
>>>> reduction in mappings, if only the most granular mapping is  
>>>> retained.
>>>> The number of IEAs will increase,  as you can see above the pombe  
>>>> mappings have doubled in the past couple of yeaars, but most  
>>>> mappings do not add any new information to the annotation.
>>>>
>>>> Just a suggestion,
>>>> Val
>>>>
>>>>
>>>> Mike Cherry wrote:
>>>>
>>>>> This afternoon the software group agreed to changing how we  
>>>>> store the  goa_uniprot GAF file.  The large file will still be  
>>>>> removed from CVS.   This is okay because the file is too big for  
>>>>> CVS and cannot currently  be retrieved.  This file will still be  
>>>>> available via to GO FTP and  from the EBI FTP.  Both the  
>>>>> submitted and filtered goa_uniprot files  will be removed from  
>>>>> CVS.  A new filtered file will be created that  has all the IEA  
>>>>> annotations removed and this file will be in the CVS   
>>>>> repository.  Suggestions for this new file's name are welcome,  
>>>>> we were  thinking of : gene_association.goa_uniprot_noiea.gz
>>>>>
>>>>> Removing the files from CVS will happen almost immediately as   
>>>>> mentioned above you cannot get it from CVS anyway.  The older  
>>>>> version  of the goa_uniprot file are available from the EBI FTP  
>>>>> site.  The  files will still be available via HTTP and FTP at  www.geneontology.org 
>>>>> .  This change simply means they will not be  obtainable via a  
>>>>> checkout from CVS.  I'll work on creating the new  noiea file  
>>>>> and add it to CVS next week.
>>>>>
>>>>> -Mike
>>>>>
>>>>> _______________________________________________
>>>>> Go mailing list
>>>>> Go at geneontology.org
>>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> -- 
>>> The Wellcome Trust Sanger Institute is operated by Genome Research  
>>> Limited, a charity registered in England with number 1021457 and a  
>>> company registered in England with number 2742969, whose  
>>> registered office is 215 Euston Road, London, NW1 2BE.
>>
>> _______________________________________________
>> Go mailing list
>> Go at geneontology.org
>> http://fafner.stanford.edu/mailman/listinfo/go
>
> =====================================
> Jim Hu
> Associate Professor
> Dept. of Biochemistry and Biophysics
> 2128 TAMU
> Texas A&M Univ.
> College Station, TX 77843-2128
> 979-862-4054
>
>
> _______________________________________________
> Go mailing list
> Go at geneontology.org
> http://fafner.stanford.edu/mailman/listinfo/go

--
Ben Hitz
Senior Scientific Programmer
Saccharomyces Genome Project
Stanford University
hitz at genome.stanford.edu



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://fafner.stanford.edu/pipermail/go/attachments/20090508/92bbd548/attachment-0001.html>


More information about the Go mailing list