[Go] Change for GOA UniProtKB GAF file./ A suggestion
Valerie Wood
val at sanger.ac.uk
Fri May 8 04:19:27 PDT 2009
Slightly related, what is the long term strategy for getting IEA data
into AmiGO?
A the main problem is the volume of annotations I have a suggestion:
For pombe we only include the IEA mappings in the data set provided to
GO when they are non redundant with existing annotations.
In 2006 there were ~30000 electronic mappings, and ~15000 were retained
Today there are 55939 mappings and 4686 are retained.
For example tim44 has the following mappings:
From IPR007379
Process GO:0006886 intracellular protein transport
Function GO:0015450 P-P-bond-hydrolysis-driven protein transmembrane
transporter activity
Component GO:0005744 mitochondrial inner membrane presequence
translocase complex
From IPR005682
GO:0006886 intracellular protein transport
Function GO:0015450 P-P-bond-hydrolysis-driven protein transmembrane
transporter activity
Component GO:0005744 mitochondrial inner membrane presequence
translocase complex
From SP-KW
intracellular protein transmembrane transport
ATP binding
Only the mapping to ATP binding is retained as all of the others are
covered by the manual annotation
Other genes have many more redundent mappings, for example top2 has 60
mappings including 7 Interpro domains mapping to the same GO:0003677.
These 60 mappings are fully represented by the 12 manual experimental
annotation.
This procedure has a number of advantages
i) Clearer for Users
It removes a massive over-presentation of data to the user.
I cannot see any major advantage in presenting redundant mappings to
the user.
ii) Quality control.
Because the curator is not presented with so many mappings, and complete
annotation should, in theory, cover the mappings (except in a minority
of cased, it should be possible to make an ISS to a characterised
ortholog).
By following this annotation protocol, spurious mappings are easily
identified and can be filtered and fixed.
Many 100's of mappings have been fixed in this way
http://sourceforge.net/tracker/?atid=605890&group_id=36855&func=browse
This also alters to problems in the ontology files, if a parent is
accidently removed, and this parent contains a valid mapping, the
annotation will 'reappear', alterting the curator to problems with the
ontology (this doesn't happen very often but it does provide an addition
layer of QC)
ii) Space
It would generally reduce the size of the mapping file.
I have no idea of the size reduction. The reduction for pombe is > 90%
but this is because the annotation coverage is high.
However, even un-annotated organisms could have an associated reduction
in mappings, if only the most granular mapping is retained.
The number of IEAs will increase, as you can see above the pombe
mappings have doubled in the past couple of yeaars, but most mappings do
not add any new information to the annotation.
Just a suggestion,
Val
Mike Cherry wrote:
> This afternoon the software group agreed to changing how we store the
> goa_uniprot GAF file. The large file will still be removed from
> CVS. This is okay because the file is too big for CVS and cannot
> currently be retrieved. This file will still be available via to GO
> FTP and from the EBI FTP. Both the submitted and filtered
> goa_uniprot files will be removed from CVS. A new filtered file will
> be created that has all the IEA annotations removed and this file
> will be in the CVS repository. Suggestions for this new file's name
> are welcome, we were thinking of : gene_association.goa_uniprot_noiea.gz
>
> Removing the files from CVS will happen almost immediately as
> mentioned above you cannot get it from CVS anyway. The older version
> of the goa_uniprot file are available from the EBI FTP site. The
> files will still be available via HTTP and FTP at
> www.geneontology.org. This change simply means they will not be
> obtainable via a checkout from CVS. I'll work on creating the new
> noiea file and add it to CVS next week.
>
> -Mike
>
> _______________________________________________
> Go mailing list
> Go at geneontology.org
> http://fafner.stanford.edu/mailman/listinfo/go
>
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Go
mailing list