[Go] Change for GOA UniProtKB GAF file./ A suggestion

Valerie Wood val at sanger.ac.uk
Fri May 8 05:12:23 PDT 2009


Correction, there are only 44104 IEA mappings for pombe not 55939 but 
all of the other numbers are correct (my taxon ID is a substring of 
other taxon IDs....).

Valerie Wood wrote:

>
> Slightly related, what is the long term strategy for getting IEA data 
> into AmiGO?
> A the main problem is the volume of annotations I have a suggestion:
>
> For pombe we only include the IEA mappings in the data set provided to 
> GO when they are non redundant with existing annotations.
> In 2006 there were ~30000 electronic mappings, and ~15000 were retained
> Today there are 55939 mappings and 4686 are retained.
>
> For example tim44 has the following mappings:
> From IPR007379
> Process       GO:0006886 intracellular protein transport
> Function     GO:0015450 P-P-bond-hydrolysis-driven protein 
> transmembrane transporter activity
> Component     GO:0005744 mitochondrial inner membrane presequence 
> translocase complex
> From IPR005682
> GO:0006886 intracellular protein transport
> Function     GO:0015450 P-P-bond-hydrolysis-driven protein 
> transmembrane transporter activity
> Component     GO:0005744 mitochondrial inner membrane presequence 
> translocase complex
> From SP-KW
> intracellular protein transmembrane transport
> ATP binding
>
> Only the mapping to ATP binding is retained as all of the others are 
> covered by the manual annotation
>
> Other genes have many more redundent mappings, for example top2 has 60 
> mappings including 7 Interpro domains mapping to the same GO:0003677. 
> These 60 mappings are fully represented by the 12 manual experimental 
> annotation.
>
> This procedure has a number of advantages
>
> i) Clearer for Users
> It removes a massive over-presentation of data to the user.
> I cannot see any major advantage in presenting  redundant mappings to 
> the user.
>
> ii) Quality control.
> Because the curator is not presented with so many mappings, and 
> complete annotation should, in theory, cover  the mappings (except in 
> a minority of cased, it should be  possible to make an ISS to a 
> characterised ortholog).
> By following this annotation protocol, spurious mappings are easily 
> identified and can be filtered and fixed.
> Many 100's of mappings have been fixed in this way
> http://sourceforge.net/tracker/?atid=605890&group_id=36855&func=browse
> This also alters to problems in the ontology files,  if a parent is 
> accidently removed, and this parent contains a valid mapping, the 
> annotation will 'reappear', alterting the curator to problems with the 
> ontology (this doesn't happen very often but it does provide an 
> addition layer of QC)
>
> ii) Space
> It would generally reduce the size of the mapping file.
> I have no idea of the size reduction. The reduction for pombe is > 90% 
> but this is because the annotation coverage is high.
> However, even un-annotated organisms could have an associated 
> reduction in mappings, if only the most granular mapping is retained.
> The number of IEAs will increase,  as you can see above the pombe 
> mappings have doubled in the past couple of yeaars, but most mappings 
> do not add any new information to the annotation.
>
> Just a suggestion,
> Val
>
>
> Mike Cherry wrote:
>
>> This afternoon the software group agreed to changing how we store 
>> the  goa_uniprot GAF file.  The large file will still be removed from 
>> CVS.   This is okay because the file is too big for CVS and cannot 
>> currently  be retrieved.  This file will still be available via to GO 
>> FTP and  from the EBI FTP.  Both the submitted and filtered 
>> goa_uniprot files  will be removed from CVS.  A new filtered file 
>> will be created that  has all the IEA annotations removed and this 
>> file will be in the CVS  repository.  Suggestions for this new file's 
>> name are welcome, we were  thinking of : 
>> gene_association.goa_uniprot_noiea.gz
>>
>> Removing the files from CVS will happen almost immediately as  
>> mentioned above you cannot get it from CVS anyway.  The older 
>> version  of the goa_uniprot file are available from the EBI FTP 
>> site.  The  files will still be available via HTTP and FTP at  
>> www.geneontology.org.  This change simply means they will not be  
>> obtainable via a checkout from CVS.  I'll work on creating the new  
>> noiea file and add it to CVS next week.
>>
>> -Mike
>>
>> _______________________________________________
>> Go mailing list
>> Go at geneontology.org
>> http://fafner.stanford.edu/mailman/listinfo/go
>>
>
>
>



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


More information about the Go mailing list