[Go] Change for GOA UniProtKB GAF file./ A suggestion
Benjamin Hitz
hitz at genome.stanford.edu
Fri May 8 10:52:31 PDT 2009
I would recommend getting it directly from EBI.
Ben
On May 8, 2009, at 8:52 AM, Jim Hu wrote:
> Hi Mike,
>
> We were planning to use the UniProtKB file from the submissions
> directory to pull some IEAs into GONUTS. Will there still be a way
> to do that?
>
> Thanks
>
> Jim
>
>
> On May 8, 2009, at 8:49 AM, Mike Cherry wrote:
>
>> This change is all about allows IEAs from all the GAF files into
>> AmiGO, except the IEAs that are in the UniProtKB file. Their are
>> too many IEAs in UniProtKB file for AmiGO and the GO database to
>> provide a reasonable return. Actually this is not all about the
>> IEAs. A big part of this is getting the UniProtKB file out of CVS
>> as its too big for that system. With the change the GO DB loading
>> can use all the GAFs in CVS and load all their annotations,
>> including IEAs
>>
>> -Mike
>>
>>
>> On May 8, 2009, at 5:12 AM, Valerie Wood wrote:
>>
>>> Correction, there are only 44104 IEA mappings for pombe not 55939
>>> but all of the other numbers are correct (my taxon ID is a
>>> substring of other taxon IDs....).
>>>
>>> Valerie Wood wrote:
>>>
>>>>
>>>> Slightly related, what is the long term strategy for getting IEA
>>>> data into AmiGO?
>>>> A the main problem is the volume of annotations I have a
>>>> suggestion:
>>>>
>>>> For pombe we only include the IEA mappings in the data set
>>>> provided to GO when they are non redundant with existing
>>>> annotations.
>>>> In 2006 there were ~30000 electronic mappings, and ~15000 were
>>>> retained
>>>> Today there are 55939 mappings and 4686 are retained.
>>>>
>>>> For example tim44 has the following mappings:
>>>> From IPR007379
>>>> Process GO:0006886 intracellular protein transport
>>>> Function GO:0015450 P-P-bond-hydrolysis-driven protein
>>>> transmembrane transporter activity
>>>> Component GO:0005744 mitochondrial inner membrane presequence
>>>> translocase complex
>>>> From IPR005682
>>>> GO:0006886 intracellular protein transport
>>>> Function GO:0015450 P-P-bond-hydrolysis-driven protein
>>>> transmembrane transporter activity
>>>> Component GO:0005744 mitochondrial inner membrane presequence
>>>> translocase complex
>>>> From SP-KW
>>>> intracellular protein transmembrane transport
>>>> ATP binding
>>>>
>>>> Only the mapping to ATP binding is retained as all of the others
>>>> are covered by the manual annotation
>>>>
>>>> Other genes have many more redundent mappings, for example top2
>>>> has 60 mappings including 7 Interpro domains mapping to the same
>>>> GO:0003677. These 60 mappings are fully represented by the 12
>>>> manual experimental annotation.
>>>>
>>>> This procedure has a number of advantages
>>>>
>>>> i) Clearer for Users
>>>> It removes a massive over-presentation of data to the user.
>>>> I cannot see any major advantage in presenting redundant
>>>> mappings to the user.
>>>>
>>>> ii) Quality control.
>>>> Because the curator is not presented with so many mappings, and
>>>> complete annotation should, in theory, cover the mappings
>>>> (except in a minority of cased, it should be possible to make an
>>>> ISS to a characterised ortholog).
>>>> By following this annotation protocol, spurious mappings are
>>>> easily identified and can be filtered and fixed.
>>>> Many 100's of mappings have been fixed in this way
>>>> http://sourceforge.net/tracker/?atid=605890&group_id=36855&func=browse
>>>> This also alters to problems in the ontology files, if a parent
>>>> is accidently removed, and this parent contains a valid mapping,
>>>> the annotation will 'reappear', alterting the curator to problems
>>>> with the ontology (this doesn't happen very often but it does
>>>> provide an addition layer of QC)
>>>>
>>>> ii) Space
>>>> It would generally reduce the size of the mapping file.
>>>> I have no idea of the size reduction. The reduction for pombe is
>>>> > 90% but this is because the annotation coverage is high.
>>>> However, even un-annotated organisms could have an associated
>>>> reduction in mappings, if only the most granular mapping is
>>>> retained.
>>>> The number of IEAs will increase, as you can see above the pombe
>>>> mappings have doubled in the past couple of yeaars, but most
>>>> mappings do not add any new information to the annotation.
>>>>
>>>> Just a suggestion,
>>>> Val
>>>>
>>>>
>>>> Mike Cherry wrote:
>>>>
>>>>> This afternoon the software group agreed to changing how we
>>>>> store the goa_uniprot GAF file. The large file will still be
>>>>> removed from CVS. This is okay because the file is too big for
>>>>> CVS and cannot currently be retrieved. This file will still be
>>>>> available via to GO FTP and from the EBI FTP. Both the
>>>>> submitted and filtered goa_uniprot files will be removed from
>>>>> CVS. A new filtered file will be created that has all the IEA
>>>>> annotations removed and this file will be in the CVS
>>>>> repository. Suggestions for this new file's name are welcome,
>>>>> we were thinking of : gene_association.goa_uniprot_noiea.gz
>>>>>
>>>>> Removing the files from CVS will happen almost immediately as
>>>>> mentioned above you cannot get it from CVS anyway. The older
>>>>> version of the goa_uniprot file are available from the EBI FTP
>>>>> site. The files will still be available via HTTP and FTP at www.geneontology.org
>>>>> . This change simply means they will not be obtainable via a
>>>>> checkout from CVS. I'll work on creating the new noiea file
>>>>> and add it to CVS next week.
>>>>>
>>>>> -Mike
>>>>>
>>>>> _______________________________________________
>>>>> Go mailing list
>>>>> Go at geneontology.org
>>>>> http://fafner.stanford.edu/mailman/listinfo/go
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> The Wellcome Trust Sanger Institute is operated by Genome Research
>>> Limited, a charity registered in England with number 1021457 and a
>>> company registered in England with number 2742969, whose
>>> registered office is 215 Euston Road, London, NW1 2BE.
>>
>> _______________________________________________
>> Go mailing list
>> Go at geneontology.org
>> http://fafner.stanford.edu/mailman/listinfo/go
>
> =====================================
> Jim Hu
> Associate Professor
> Dept. of Biochemistry and Biophysics
> 2128 TAMU
> Texas A&M Univ.
> College Station, TX 77843-2128
> 979-862-4054
>
>
> _______________________________________________
> Go mailing list
> Go at geneontology.org
> http://fafner.stanford.edu/mailman/listinfo/go
--
Ben Hitz
Senior Scientific Programmer
Saccharomyces Genome Project
Stanford University
hitz at genome.stanford.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://fafner.stanford.edu/pipermail/go/attachments/20090508/92bbd548/attachment-0001.html>
More information about the Go
mailing list