[Go] Change for GOA UniProtKB GAF file./ A suggestion

Jim Hu jimhu at tamu.edu
Fri May 8 08:52:57 PDT 2009


Hi Mike,

We were planning to use the UniProtKB file from the submissions  
directory to pull some IEAs into GONUTS.  Will there still be a way to  
do that?

Thanks

Jim


On May 8, 2009, at 8:49 AM, Mike Cherry wrote:

> This change is all about allows IEAs from all the GAF files into  
> AmiGO, except the IEAs that are in the UniProtKB file.  Their are  
> too many IEAs in UniProtKB file for AmiGO and the GO database to  
> provide a reasonable return.  Actually this is not all about the  
> IEAs.  A big part of this is getting the UniProtKB file out of CVS  
> as its too big for that system.  With the change the GO DB loading  
> can use all the GAFs in CVS and load all their annotations,  
> including IEAs
>
> -Mike
>
>
> On May 8, 2009, at 5:12 AM, Valerie Wood wrote:
>
>> Correction, there are only 44104 IEA mappings for pombe not 55939  
>> but all of the other numbers are correct (my taxon ID is a  
>> substring of other taxon IDs....).
>>
>> Valerie Wood wrote:
>>
>>>
>>> Slightly related, what is the long term strategy for getting IEA  
>>> data into AmiGO?
>>> A the main problem is the volume of annotations I have a suggestion:
>>>
>>> For pombe we only include the IEA mappings in the data set  
>>> provided to GO when they are non redundant with existing  
>>> annotations.
>>> In 2006 there were ~30000 electronic mappings, and ~15000 were  
>>> retained
>>> Today there are 55939 mappings and 4686 are retained.
>>>
>>> For example tim44 has the following mappings:
>>> From IPR007379
>>> Process       GO:0006886 intracellular protein transport
>>> Function     GO:0015450 P-P-bond-hydrolysis-driven protein  
>>> transmembrane transporter activity
>>> Component     GO:0005744 mitochondrial inner membrane presequence  
>>> translocase complex
>>> From IPR005682
>>> GO:0006886 intracellular protein transport
>>> Function     GO:0015450 P-P-bond-hydrolysis-driven protein  
>>> transmembrane transporter activity
>>> Component     GO:0005744 mitochondrial inner membrane presequence  
>>> translocase complex
>>> From SP-KW
>>> intracellular protein transmembrane transport
>>> ATP binding
>>>
>>> Only the mapping to ATP binding is retained as all of the others  
>>> are covered by the manual annotation
>>>
>>> Other genes have many more redundent mappings, for example top2  
>>> has 60 mappings including 7 Interpro domains mapping to the same  
>>> GO:0003677. These 60 mappings are fully represented by the 12  
>>> manual experimental annotation.
>>>
>>> This procedure has a number of advantages
>>>
>>> i) Clearer for Users
>>> It removes a massive over-presentation of data to the user.
>>> I cannot see any major advantage in presenting  redundant mappings  
>>> to the user.
>>>
>>> ii) Quality control.
>>> Because the curator is not presented with so many mappings, and  
>>> complete annotation should, in theory, cover  the mappings (except  
>>> in a minority of cased, it should be  possible to make an ISS to a  
>>> characterised ortholog).
>>> By following this annotation protocol, spurious mappings are  
>>> easily identified and can be filtered and fixed.
>>> Many 100's of mappings have been fixed in this way
>>> http://sourceforge.net/tracker/?atid=605890&group_id=36855&func=browse
>>> This also alters to problems in the ontology files,  if a parent  
>>> is accidently removed, and this parent contains a valid mapping,  
>>> the annotation will 'reappear', alterting the curator to problems  
>>> with the ontology (this doesn't happen very often but it does  
>>> provide an addition layer of QC)
>>>
>>> ii) Space
>>> It would generally reduce the size of the mapping file.
>>> I have no idea of the size reduction. The reduction for pombe is >  
>>> 90% but this is because the annotation coverage is high.
>>> However, even un-annotated organisms could have an associated  
>>> reduction in mappings, if only the most granular mapping is  
>>> retained.
>>> The number of IEAs will increase,  as you can see above the pombe  
>>> mappings have doubled in the past couple of yeaars, but most  
>>> mappings do not add any new information to the annotation.
>>>
>>> Just a suggestion,
>>> Val
>>>
>>>
>>> Mike Cherry wrote:
>>>
>>>> This afternoon the software group agreed to changing how we store  
>>>> the  goa_uniprot GAF file.  The large file will still be removed  
>>>> from CVS.   This is okay because the file is too big for CVS and  
>>>> cannot currently  be retrieved.  This file will still be  
>>>> available via to GO FTP and  from the EBI FTP.  Both the  
>>>> submitted and filtered goa_uniprot files  will be removed from  
>>>> CVS.  A new filtered file will be created that  has all the IEA  
>>>> annotations removed and this file will be in the CVS   
>>>> repository.  Suggestions for this new file's name are welcome, we  
>>>> were  thinking of : gene_association.goa_uniprot_noiea.gz
>>>>
>>>> Removing the files from CVS will happen almost immediately as   
>>>> mentioned above you cannot get it from CVS anyway.  The older  
>>>> version  of the goa_uniprot file are available from the EBI FTP  
>>>> site.  The  files will still be available via HTTP and FTP at  www.geneontology.org 
>>>> .  This change simply means they will not be  obtainable via a  
>>>> checkout from CVS.  I'll work on creating the new  noiea file and  
>>>> add it to CVS next week.
>>>>
>>>> -Mike
>>>>
>>>> _______________________________________________
>>>> Go mailing list
>>>> Go at geneontology.org
>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>
>>>
>>>
>>>
>>
>>
>>
>> -- 
>> The Wellcome Trust Sanger Institute is operated by Genome Research  
>> Limited, a charity registered in England with number 1021457 and a  
>> company registered in England with number 2742969, whose registered  
>> office is 215 Euston Road, London, NW1 2BE.
>
> _______________________________________________
> Go mailing list
> Go at geneontology.org
> http://fafner.stanford.edu/mailman/listinfo/go

=====================================
Jim Hu
Associate Professor
Dept. of Biochemistry and Biophysics
2128 TAMU
Texas A&M Univ.
College Station, TX 77843-2128
979-862-4054


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://fafner.stanford.edu/pipermail/go/attachments/20090508/8ab35c0b/attachment.html>


More information about the Go mailing list