[Annotation] evidence code advice

David Hill dph at informatics.jax.org
Wed Apr 2 18:00:58 PDT 2008


Mike,

The vast majority of our RCA annotations at MGI come from the FANTOM 
mouse cDNA annotation project. For this project, a suite of analysis was 
done on cDNA clones and curators were given a set of GO annotations that 
they could either accept or reject as they were annotating cDNAs.  This 
was all done during 2 huge jamborees at the FANTOM meetings. So in this 
case, the original data was a suite of computational analyses and then 
curators used their judgment to determine if they thought the GO 
predictions were acceptable.

I think Harold has also curated RCA annotations from a limited number of 
papers that reported on large-scale experiments where there were a few 
hundred annotations.

In your case, RCA seems to be the most reasonable evidence code to use. 
I actually remember some discussion at a GOC meeting a while back as to 
whether every annotation had to be reviewed for RCA or whether the 
analysis needed to be carefully reviewed to put more confidence in the 
annotations than just an IEA. I remember that the issue was even IEA 
methods are reviewed, so where is the cut off. I don't think we ever 
came to a firm conclusion, but I remember it was discussed.

David

Mike Cherry wrote:
> We need an evidence code for the data Rama mentioned.  As IEA  
> annotations have cardinality of 1 for the WITH field (this was defined  
> at the Jesus College GOC meeting) and RCA seems to require each  
> association to be curated.  We have a catch-22.  I too agree that  
> Kara's proposal would be useful, but gets us into some bigger  
> changes.  I also agree with Suzanna that a solution would be to remove  
> the requirement that every association be curated for RCA.  This is  
> not perfect but could be a temporary solution.  There likely needs to  
> be a curated and non-curated form of a RCA-like evidence code.
>
> On curation of RCA:
>
> The RCA documentation lists two examples, the Samanta and the  
> Troyanskaya papers.  In those papers only a slice of their predictions  
> were published to make their case for the methods used.  They did not  
> include all their significant predictions from their databases.  We  
> curated the slice published but because of the curated requirement did  
> not pull out other significant results from their datasets.  We now  
> have other papers like those with many more potential annotations  
> reported in the paper.  Also we still have potential annotations that  
> could be added from the Troyanskaya database (BioPixie) that are  
> continually refined and updated.
>
> Ability to curate all these annotations:
>
> The two papers mentioned by Rama include several hundred, not just a  
> 100, assertions being made from the combination of experimental  
> results.  We disagree that these annotations are often wrong.  The  
> combinations of all these data removes the questionable results.   
> These methods are generally reviewed for publication to allow the  
> specificity and recall to be determined.  SGD has been involved in  
> some of these analyses by reviewing a large number of their results --  
> but not all.  These annotations are generally very useful in our view.
>
> For us there are too many of these annotations to curate.  These are  
> assertions that are made by an analysis of IGI, IPI, IEP, IDA and  
> sometimes ISM evidence to make new interesting and statistically  
> significant associations.  There is no literature for many of the  
> specific associations and would thus not be possible to curate.  These  
> associations often identify errors in the literature and plus add new  
> associations that have not been reported, but are supported by the  
> combined data.  These are not based on just HTP data, the methods are  
> typically trained using all existing non-IEA data from SGD.  We use  
> the results from these papers to identify problems with the literature  
> annotations, but we are not able to review each of the assertions from  
> these new papers.
>
> I am interested to learn how Gramene (60,938 - 75% of all  
> associations), TAIR (23,486 - 22%), MGI (12,999 - 8%), RGD (5,089 -  
> 2%) and PseudoCAP (2,572 - 35%) use RCA -- thats the number of RCA and  
> the percent of total associations provided by the project.  If  
> everyone has curated all those annotations then more power to them and  
> SGD just needs to figure out how to do more.
>
> We don't believe any of the current evidence codes as defined are  
> appropriate for the associations we would like to include.  IEA  
> requires the WITH field and RCA requires every annotation to be  
> curated.  So what should we do?
>
> -Mike
>
>
> On Apr 2, 2008, at 3:14 AM, Valerie Wood wrote:
>   
>> Rama Balakrishnan wrote:
>>     
>>>> Anyway, in light of that history, I think it would make most sense
>>>> if the
>>>> absolute requirement for the with column to be filled for IEA was
>>>> dropped
>>>> in the short term, so that we can use the IEA code for unreviewed
>>>> annotations from RCA methods.
>>>>
>>>>         
>>> I think it is important to require the 'with' column for IEAs to
>>> prevent circular annotations.
>>> The other option is to revert the RCA code to its original version
>>> which required only the computational method to be reviewed and not
>>> every annotation.
>>>
>>>       
>> Hi Rama,
>>
>> I wonder about the value of RCA annotations as part of the body of GO
>> annotations if they are not reviewed?
>> This code usually provides the most tentative  annotation, because  
>> they
>> are generally 'function predictions'
>>
>> i.e.
>>
>>   * Predictions based on computational analyses of large-scale
>>     experimental data sets
>>   * Predictions based on computational analyses that integrate
>>     datasets of several types, including experimental data (e.g.
>>     expression data, protein-protein interaction data, genetic
>>     interaction data, etc.), sequence data (e.g. promoter sequence,
>>     sequence-based structural predictions, etc.), or mathematical  
>> models
>>
>> they frequently seem to be
>>
>> i) Obviously wrong, in a way which would easily be spotted by a  
>> curator
>> ii) Redundant with existing experimental, or other manually curated
>> annotations, or even IEA annotations
>> iii) Obvious annotation omissions (i.e when there is an ISS to
>> transporter activity, but no ISS to transporter)
>>
>> Several 100 doesn't seem so many to manually review (at least to make
>> sure they satisfy the criteria above).  It would probably save time in
>> the long run....(I'm also amazed there are so many good 'predictions'
>> for S. cerevisiae which are unnannotated already?).
>>
>> For these reasons, pending any long term solution,  I'd prefer RCA  
>> which
>> were not reviewed by a curator to be classed as 'electronically
>> inferred' because they are essentially "automated".
>>
>> My 2p
>>
>> Val
>>
>>
>> On Sun, 30 Mar 2008, Suzanna Lewis wrote:
>>
>>
>>     
>>> This is very much along the lines that I've been trying to foster
>>> (remember the meeting in Cambridge at Jesus College). The bit-code
>>> (or
>>> bar-code) for evidence codes, with each bit indicating one of these
>>> flags for a different piece of information. Not only automated/
>>> manual,
>>> but also large-scale/small-scale, and other characteristics of the
>>> evidence.
>>>
>>> As Kara (and many others) have said, there is quite a bit of over-
>>> loading of multiple pieces of information in the current evidence
>>> codes. It would be nice one day to see these distinguished into
>>> different constituent bits of information.
>>>
>>> -S
>>>
>>> p.s. I thought that IEA did not -require- the with column.
>>> p.p.s Was the decision tree a step in this direction?
>>>
>>> On Mar 26, 2008, at 1:59 PM, Kara Dolinski wrote:
>>>
>>>
>>>       
>>>> Hi,
>>>>
>>>> The root of the problem, as I see it, is that we are mixing apples
>>>> and oranges with evidence codes.  All but one of the evidence codes
>>>> indicate the type of experimental evidence for a GO annotation, but
>>>> we have one oddball, IEA, that indicates not what the experiment is,
>>>> but rather how the annotation was done.  We keep running into
>>>> variations of the same problem:  we have some evidence (whether
>>>> experimental or computational) for a GO annotation, but also want to
>>>> indicate whether a curator looked at it or not.
>>>>
>>>> My proposed (albeit radical) solution:
>>>>
>>>> Remove IEA as an evidence code.
>>>>
>>>> Create a new property for GO annotations (or add a new type of
>>>> qualifier) that captures how the annotation was done:  manual or
>>>> automated.
>>>>
>>>> Everything that is currently IEA would be given the 'automated'
>>>> property/qualifier, and then would be given a new evidence code as
>>>> appropriate (mostly a flavor of ISS I would assume).
>>>> There can be a rule that all 'automated' annotations that are a
>>>> flavor of ISS must have a 'with' value.
>>>>
>>>> This would allow us to use 'RCA' as appropriate, in some cases
>>>> they'd be 'manual', in others, they'd be 'automated'.  In Rama's
>>>> case, the annotations would be 'RCA' with an 'automated' qualifier.
>>>>
>>>> I realize the issues involved in making such a drastic change, so I
>>>> understand if we don't go there, but I do think that some approach
>>>> such as the one above is the best representation of the information
>>>> that we are trying to capture.
>>>>
>>>> Cheers,
>>>> Kara
>>>>
>>>> On Mar 26, 2008, at 4:30 PM, Rama Balakrishnan wrote:
>>>>
>>>>
>>>>         
>>>>> Hi All,
>>>>>
>>>>> SGD has come across couple of computationally predicted GO
>>>>> annotation data sets for S. cerevisiae that we would like to add to
>>>>> our database. The GO annotations from these data sets are
>>>>> predictions based on multiple high-throughput data sets. RCA
>>>>> evidence code came to our minds but according to the documentation,
>>>>> the annotations all have to be manually reviewed by a curator to
>>>>> use this evidence. There are several 100 annotations of this kind
>>>>> and it is not feasible for us to manually review these annotations.
>>>>>
>>>>> Hence, we thought these annotations can be bulk loaded with IEA
>>>>> evidence code. However, in the Jan 2007 (Cambridge) GO meeting, it
>>>>> was decided that the 'with' column information has to be filled in
>>>>> for all IEAs (else Mike's filtering script strips them out). But
>>>>> these GO annotations being predictions based on multiple high-
>>>>> throughput data sets, don't have any information for the with
>>>>> column.  So, we are left with no choice.
>>>>>
>>>>> Which evidence code do people think should be used for these kinds
>>>>> of computational datasets when there is not an obvious "with"?
>>>>>
>>>>> Thanks for your input.
>>>>>
>>>>> Rama
>>>>>
>>>>>           
>> _______________________________________________
>> Annotation mailing list
>> Annotation at geneontology.org
>> http://fafner.stanford.edu/mailman/listinfo/annotation
>>
>>     
>
>
> _______________________________________________
> Annotation mailing list
> Annotation at geneontology.org
> http://fafner.stanford.edu/mailman/listinfo/annotation
>   


More information about the Annotation mailing list