[Go] Change for GOA UniProtKB GAF file./ A suggestion

Valerie Wood val at sanger.ac.uk
Fri May 8 07:08:45 PDT 2009


So with this change, the filtered IEAs that I provide in my gene 
association file will go into AmiGO? that is good.

The other suggestions still apply. Its just some general comments about 
how IEAs are most useful. The number of mappings are increasing all the 
time, and most are to much higher level terms than existing annotations.

Val

Mike Cherry wrote:

> This change is all about allows IEAs from all the GAF files into  
> AmiGO, except the IEAs that are in the UniProtKB file.  Their are too  
> many IEAs in UniProtKB file for AmiGO and the GO database to provide 
> a  reasonable return.  Actually this is not all about the IEAs.  A 
> big  part of this is getting the UniProtKB file out of CVS as its too 
> big  for that system.  With the change the GO DB loading can use all 
> the  GAFs in CVS and load all their annotations, including IEAs
>
> -Mike
>
>
> On May 8, 2009, at 5:12 AM, Valerie Wood wrote:
>
>> Correction, there are only 44104 IEA mappings for pombe not 55939  
>> but all of the other numbers are correct (my taxon ID is a substring  
>> of other taxon IDs....).
>>
>> Valerie Wood wrote:
>>
>>>
>>> Slightly related, what is the long term strategy for getting IEA  
>>> data into AmiGO?
>>> A the main problem is the volume of annotations I have a suggestion:
>>>
>>> For pombe we only include the IEA mappings in the data set provided  
>>> to GO when they are non redundant with existing annotations.
>>> In 2006 there were ~30000 electronic mappings, and ~15000 were  
>>> retained
>>> Today there are 55939 mappings and 4686 are retained.
>>>
>>> For example tim44 has the following mappings:
>>> From IPR007379
>>> Process       GO:0006886 intracellular protein transport
>>> Function     GO:0015450 P-P-bond-hydrolysis-driven protein  
>>> transmembrane transporter activity
>>> Component     GO:0005744 mitochondrial inner membrane presequence  
>>> translocase complex
>>> From IPR005682
>>> GO:0006886 intracellular protein transport
>>> Function     GO:0015450 P-P-bond-hydrolysis-driven protein  
>>> transmembrane transporter activity
>>> Component     GO:0005744 mitochondrial inner membrane presequence  
>>> translocase complex
>>> From SP-KW
>>> intracellular protein transmembrane transport
>>> ATP binding
>>>
>>> Only the mapping to ATP binding is retained as all of the others  
>>> are covered by the manual annotation
>>>
>>> Other genes have many more redundent mappings, for example top2 has  
>>> 60 mappings including 7 Interpro domains mapping to the same GO: 
>>> 0003677. These 60 mappings are fully represented by the 12 manual  
>>> experimental annotation.
>>>
>>> This procedure has a number of advantages
>>>
>>> i) Clearer for Users
>>> It removes a massive over-presentation of data to the user.
>>> I cannot see any major advantage in presenting  redundant mappings  
>>> to the user.
>>>
>>> ii) Quality control.
>>> Because the curator is not presented with so many mappings, and  
>>> complete annotation should, in theory, cover  the mappings (except  
>>> in a minority of cased, it should be  possible to make an ISS to a  
>>> characterised ortholog).
>>> By following this annotation protocol, spurious mappings are easily  
>>> identified and can be filtered and fixed.
>>> Many 100's of mappings have been fixed in this way
>>> http://sourceforge.net/tracker/? atid=605890&group_id=36855&func=browse
>>> This also alters to problems in the ontology files,  if a parent is  
>>> accidently removed, and this parent contains a valid mapping, the  
>>> annotation will 'reappear', alterting the curator to problems with  
>>> the ontology (this doesn't happen very often but it does provide an  
>>> addition layer of QC)
>>>
>>> ii) Space
>>> It would generally reduce the size of the mapping file.
>>> I have no idea of the size reduction. The reduction for pombe is >  
>>> 90% but this is because the annotation coverage is high.
>>> However, even un-annotated organisms could have an associated  
>>> reduction in mappings, if only the most granular mapping is retained.
>>> The number of IEAs will increase,  as you can see above the pombe  
>>> mappings have doubled in the past couple of yeaars, but most  
>>> mappings do not add any new information to the annotation.
>>>
>>> Just a suggestion,
>>> Val
>>>
>>>
>>> Mike Cherry wrote:
>>>
>>>> This afternoon the software group agreed to changing how we store  
>>>> the  goa_uniprot GAF file.  The large file will still be removed  
>>>> from CVS.   This is okay because the file is too big for CVS and  
>>>> cannot currently  be retrieved.  This file will still be available  
>>>> via to GO FTP and  from the EBI FTP.  Both the submitted and  
>>>> filtered goa_uniprot files  will be removed from CVS.  A new  
>>>> filtered file will be created that  has all the IEA annotations  
>>>> removed and this file will be in the CVS  repository.  Suggestions  
>>>> for this new file's name are welcome, we were  thinking of :  
>>>> gene_association.goa_uniprot_noiea.gz
>>>>
>>>> Removing the files from CVS will happen almost immediately as   
>>>> mentioned above you cannot get it from CVS anyway.  The older  
>>>> version  of the goa_uniprot file are available from the EBI FTP  
>>>> site.  The  files will still be available via HTTP and FTP at  
>>>> www.geneontology.org .  This change simply means they will not be  
>>>> obtainable via a  checkout from CVS.  I'll work on creating the 
>>>> new  noiea file and  add it to CVS next week.
>>>>
>>>> -Mike
>>>>
>>>> _______________________________________________
>>>> Go mailing list
>>>> Go at geneontology.org
>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>
>>>
>>>
>>>
>>
>>
>>
>> -- 
>> The Wellcome Trust Sanger Institute is operated by Genome Research  
>> Limited, a charity registered in England with number 1021457 and a  
>> company registered in England with number 2742969, whose registered  
>> office is 215 Euston Road, London, NW1 2BE.
>
>
>



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


More information about the Go mailing list