<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hi Mike,<div><br></div><div>We were planning to use the UniProtKB file from the submissions directory to pull some IEAs into GONUTS. Will there still be a way to do that?</div><div><br></div><div>Thanks</div><div><br></div><div>Jim</div><div><br></div><div><br><div><div>On May 8, 2009, at 8:49 AM, Mike Cherry wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>This change is all about allows IEAs from all the GAF files into AmiGO, except the IEAs that are in the UniProtKB file. Their are too many IEAs in UniProtKB file for AmiGO and the GO database to provide a reasonable return. Actually this is not all about the IEAs. A big part of this is getting the UniProtKB file out of CVS as its too big for that system. With the change the GO DB loading can use all the GAFs in CVS and load all their annotations, including IEAs<br><br>-Mike<br><br><br>On May 8, 2009, at 5:12 AM, Valerie Wood wrote:<br><br><blockquote type="cite">Correction, there are only 44104 IEA mappings for pombe not 55939 but all of the other numbers are correct (my taxon ID is a substring of other taxon IDs....).<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Valerie Wood wrote:<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Slightly related, what is the long term strategy for getting IEA data into AmiGO?<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">A the main problem is the volume of annotations I have a suggestion:<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">For pombe we only include the IEA mappings in the data set provided to GO when they are non redundant with existing annotations.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">In 2006 there were ~30000 electronic mappings, and ~15000 were retained<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Today there are 55939 mappings and 4686 are retained.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">For example tim44 has the following mappings:<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">From IPR007379<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Process GO:0006886 intracellular protein transport<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Function GO:0015450 P-P-bond-hydrolysis-driven protein transmembrane transporter activity<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Component GO:0005744 mitochondrial inner membrane presequence translocase complex<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">From IPR005682<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">GO:0006886 intracellular protein transport<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Function GO:0015450 P-P-bond-hydrolysis-driven protein transmembrane transporter activity<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Component GO:0005744 mitochondrial inner membrane presequence translocase complex<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">From SP-KW<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">intracellular protein transmembrane transport<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">ATP binding<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Only the mapping to ATP binding is retained as all of the others are covered by the manual annotation<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Other genes have many more redundent mappings, for example top2 has 60 mappings including 7 Interpro domains mapping to the same GO:0003677. These 60 mappings are fully represented by the 12 manual experimental annotation.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">This procedure has a number of advantages<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">i) Clearer for Users<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">It removes a massive over-presentation of data to the user.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">I cannot see any major advantage in presenting redundant mappings to the user.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">ii) Quality control.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Because the curator is not presented with so many mappings, and complete annotation should, in theory, cover the mappings (except in a minority of cased, it should be possible to make an ISS to a characterised ortholog).<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">By following this annotation protocol, spurious mappings are easily identified and can be filtered and fixed.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Many 100's of mappings have been fixed in this way<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><a href="http://sourceforge.net/tracker/?atid=605890&group_id=36855&func=browse">http://sourceforge.net/tracker/?atid=605890&group_id=36855&func=browse</a><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">This also alters to problems in the ontology files, if a parent is accidently removed, and this parent contains a valid mapping, the annotation will 'reappear', alterting the curator to problems with the ontology (this doesn't happen very often but it does provide an addition layer of QC)<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">ii) Space<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">It would generally reduce the size of the mapping file.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">I have no idea of the size reduction. The reduction for pombe is > 90% but this is because the annotation coverage is high.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">However, even un-annotated organisms could have an associated reduction in mappings, if only the most granular mapping is retained.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">The number of IEAs will increase, as you can see above the pombe mappings have doubled in the past couple of yeaars, but most mappings do not add any new information to the annotation.<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Just a suggestion,<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Val<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Mike Cherry wrote:<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">This afternoon the software group agreed to changing how we store the goa_uniprot GAF file. The large file will still be removed from CVS. This is okay because the file is too big for CVS and cannot currently be retrieved. This file will still be available via to GO FTP and from the EBI FTP. Both the submitted and filtered goa_uniprot files will be removed from CVS. A new filtered file will be created that has all the IEA annotations removed and this file will be in the CVS repository. Suggestions for this new file's name are welcome, we were thinking of : gene_association.goa_uniprot_noiea.gz<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Removing the files from CVS will happen almost immediately as mentioned above you cannot get it from CVS anyway. The older version of the goa_uniprot file are available from the EBI FTP site. The files will still be available via HTTP and FTP at <a href="http://www.geneontology.org">www.geneontology.org</a>. This change simply means they will not be obtainable via a checkout from CVS. I'll work on creating the new noiea file and add it to CVS next week.<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">-Mike<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">_______________________________________________<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Go mailing list<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><a href="mailto:Go@geneontology.org">Go@geneontology.org</a><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><a href="http://fafner.stanford.edu/mailman/listinfo/go">http://fafner.stanford.edu/mailman/listinfo/go</a><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">-- <br></blockquote><blockquote type="cite">The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.<br></blockquote><br>_______________________________________________<br>Go mailing list<br><a href="mailto:Go@geneontology.org">Go@geneontology.org</a><br>http://fafner.stanford.edu/mailman/listinfo/go<br></div></blockquote></div><br><div> <span class="Apple-style-span" style="border-collapse: separate; border-spacing: 0px 0px; color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; text-align: auto; -khtml-text-decorations-in-effect: none; text-indent: 0px; -apple-text-size-adjust: auto; text-transform: none; orphans: 2; white-space: normal; widows: 2; word-spacing: 0px; "><div style="word-wrap: break-word; -khtml-nbsp-mode: space; -khtml-line-break: after-white-space; "><p style="margin: 0.0px 0.0px 0.0px 0.0px"><font face="Helvetica" size="3" style="font: 12.0px Helvetica">=====================================</font></p><p style="margin: 0.0px 0.0px 0.0px 0.0px"><font face="Helvetica" size="3" style="font: 12.0px Helvetica">Jim Hu</font></p><p style="margin: 0.0px 0.0px 0.0px 0.0px"><font face="Helvetica" size="3" style="font: 12.0px Helvetica">Associate Professor</font></p><p style="margin: 0.0px 0.0px 0.0px 0.0px"><font face="Helvetica" size="3" style="font: 12.0px Helvetica">Dept. of Biochemistry and Biophysics</font></p><p style="margin: 0.0px 0.0px 0.0px 0.0px"><font face="Helvetica" size="3" style="font: 12.0px Helvetica">2128 TAMU</font></p><p style="margin: 0.0px 0.0px 0.0px 0.0px"><font face="Helvetica" size="3" style="font: 12.0px Helvetica">Texas A&M Univ.</font></p><p style="margin: 0.0px 0.0px 0.0px 0.0px"><font face="Helvetica" size="3" style="font: 12.0px Helvetica">College Station, TX 77843-2128</font></p><p style="margin: 0.0px 0.0px 0.0px 0.0px"><font face="Helvetica" size="3" style="font: 12.0px Helvetica">979-862-4054</font></p></div><br class="Apple-interchange-newline"></span> </div><br></div></body></html>