[Go] Change for GOA UniProtKB GAF file./ A suggestion
Valerie Wood
val at sanger.ac.uk
Fri May 8 05:12:23 PDT 2009
Correction, there are only 44104 IEA mappings for pombe not 55939 but
all of the other numbers are correct (my taxon ID is a substring of
other taxon IDs....).
Valerie Wood wrote:
>
> Slightly related, what is the long term strategy for getting IEA data
> into AmiGO?
> A the main problem is the volume of annotations I have a suggestion:
>
> For pombe we only include the IEA mappings in the data set provided to
> GO when they are non redundant with existing annotations.
> In 2006 there were ~30000 electronic mappings, and ~15000 were retained
> Today there are 55939 mappings and 4686 are retained.
>
> For example tim44 has the following mappings:
> From IPR007379
> Process GO:0006886 intracellular protein transport
> Function GO:0015450 P-P-bond-hydrolysis-driven protein
> transmembrane transporter activity
> Component GO:0005744 mitochondrial inner membrane presequence
> translocase complex
> From IPR005682
> GO:0006886 intracellular protein transport
> Function GO:0015450 P-P-bond-hydrolysis-driven protein
> transmembrane transporter activity
> Component GO:0005744 mitochondrial inner membrane presequence
> translocase complex
> From SP-KW
> intracellular protein transmembrane transport
> ATP binding
>
> Only the mapping to ATP binding is retained as all of the others are
> covered by the manual annotation
>
> Other genes have many more redundent mappings, for example top2 has 60
> mappings including 7 Interpro domains mapping to the same GO:0003677.
> These 60 mappings are fully represented by the 12 manual experimental
> annotation.
>
> This procedure has a number of advantages
>
> i) Clearer for Users
> It removes a massive over-presentation of data to the user.
> I cannot see any major advantage in presenting redundant mappings to
> the user.
>
> ii) Quality control.
> Because the curator is not presented with so many mappings, and
> complete annotation should, in theory, cover the mappings (except in
> a minority of cased, it should be possible to make an ISS to a
> characterised ortholog).
> By following this annotation protocol, spurious mappings are easily
> identified and can be filtered and fixed.
> Many 100's of mappings have been fixed in this way
> http://sourceforge.net/tracker/?atid=605890&group_id=36855&func=browse
> This also alters to problems in the ontology files, if a parent is
> accidently removed, and this parent contains a valid mapping, the
> annotation will 'reappear', alterting the curator to problems with the
> ontology (this doesn't happen very often but it does provide an
> addition layer of QC)
>
> ii) Space
> It would generally reduce the size of the mapping file.
> I have no idea of the size reduction. The reduction for pombe is > 90%
> but this is because the annotation coverage is high.
> However, even un-annotated organisms could have an associated
> reduction in mappings, if only the most granular mapping is retained.
> The number of IEAs will increase, as you can see above the pombe
> mappings have doubled in the past couple of yeaars, but most mappings
> do not add any new information to the annotation.
>
> Just a suggestion,
> Val
>
>
> Mike Cherry wrote:
>
>> This afternoon the software group agreed to changing how we store
>> the goa_uniprot GAF file. The large file will still be removed from
>> CVS. This is okay because the file is too big for CVS and cannot
>> currently be retrieved. This file will still be available via to GO
>> FTP and from the EBI FTP. Both the submitted and filtered
>> goa_uniprot files will be removed from CVS. A new filtered file
>> will be created that has all the IEA annotations removed and this
>> file will be in the CVS repository. Suggestions for this new file's
>> name are welcome, we were thinking of :
>> gene_association.goa_uniprot_noiea.gz
>>
>> Removing the files from CVS will happen almost immediately as
>> mentioned above you cannot get it from CVS anyway. The older
>> version of the goa_uniprot file are available from the EBI FTP
>> site. The files will still be available via HTTP and FTP at
>> www.geneontology.org. This change simply means they will not be
>> obtainable via a checkout from CVS. I'll work on creating the new
>> noiea file and add it to CVS next week.
>>
>> -Mike
>>
>> _______________________________________________
>> Go mailing list
>> Go at geneontology.org
>> http://fafner.stanford.edu/mailman/listinfo/go
>>
>
>
>
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Go
mailing list