[Annotation] evidence code advice
Mike Cherry
cherry at stanford.edu
Wed Apr 2 14:15:03 PDT 2008
We need an evidence code for the data Rama mentioned. As IEA
annotations have cardinality of 1 for the WITH field (this was defined
at the Jesus College GOC meeting) and RCA seems to require each
association to be curated. We have a catch-22. I too agree that
Kara's proposal would be useful, but gets us into some bigger
changes. I also agree with Suzanna that a solution would be to remove
the requirement that every association be curated for RCA. This is
not perfect but could be a temporary solution. There likely needs to
be a curated and non-curated form of a RCA-like evidence code.
On curation of RCA:
The RCA documentation lists two examples, the Samanta and the
Troyanskaya papers. In those papers only a slice of their predictions
were published to make their case for the methods used. They did not
include all their significant predictions from their databases. We
curated the slice published but because of the curated requirement did
not pull out other significant results from their datasets. We now
have other papers like those with many more potential annotations
reported in the paper. Also we still have potential annotations that
could be added from the Troyanskaya database (BioPixie) that are
continually refined and updated.
Ability to curate all these annotations:
The two papers mentioned by Rama include several hundred, not just a
100, assertions being made from the combination of experimental
results. We disagree that these annotations are often wrong. The
combinations of all these data removes the questionable results.
These methods are generally reviewed for publication to allow the
specificity and recall to be determined. SGD has been involved in
some of these analyses by reviewing a large number of their results --
but not all. These annotations are generally very useful in our view.
For us there are too many of these annotations to curate. These are
assertions that are made by an analysis of IGI, IPI, IEP, IDA and
sometimes ISM evidence to make new interesting and statistically
significant associations. There is no literature for many of the
specific associations and would thus not be possible to curate. These
associations often identify errors in the literature and plus add new
associations that have not been reported, but are supported by the
combined data. These are not based on just HTP data, the methods are
typically trained using all existing non-IEA data from SGD. We use
the results from these papers to identify problems with the literature
annotations, but we are not able to review each of the assertions from
these new papers.
I am interested to learn how Gramene (60,938 - 75% of all
associations), TAIR (23,486 - 22%), MGI (12,999 - 8%), RGD (5,089 -
2%) and PseudoCAP (2,572 - 35%) use RCA -- thats the number of RCA and
the percent of total associations provided by the project. If
everyone has curated all those annotations then more power to them and
SGD just needs to figure out how to do more.
We don't believe any of the current evidence codes as defined are
appropriate for the associations we would like to include. IEA
requires the WITH field and RCA requires every annotation to be
curated. So what should we do?
-Mike
On Apr 2, 2008, at 3:14 AM, Valerie Wood wrote:
> Rama Balakrishnan wrote:
>>> Anyway, in light of that history, I think it would make most sense
>>> if the
>>> absolute requirement for the with column to be filled for IEA was
>>> dropped
>>> in the short term, so that we can use the IEA code for unreviewed
>>> annotations from RCA methods.
>>>
>>
>> I think it is important to require the 'with' column for IEAs to
>> prevent circular annotations.
>> The other option is to revert the RCA code to its original version
>> which required only the computational method to be reviewed and not
>> every annotation.
>>
>
>
> Hi Rama,
>
> I wonder about the value of RCA annotations as part of the body of GO
> annotations if they are not reviewed?
> This code usually provides the most tentative annotation, because
> they
> are generally 'function predictions'
>
> i.e.
>
> * Predictions based on computational analyses of large-scale
> experimental data sets
> * Predictions based on computational analyses that integrate
> datasets of several types, including experimental data (e.g.
> expression data, protein-protein interaction data, genetic
> interaction data, etc.), sequence data (e.g. promoter sequence,
> sequence-based structural predictions, etc.), or mathematical
> models
>
> they frequently seem to be
>
> i) Obviously wrong, in a way which would easily be spotted by a
> curator
> ii) Redundant with existing experimental, or other manually curated
> annotations, or even IEA annotations
> iii) Obvious annotation omissions (i.e when there is an ISS to
> transporter activity, but no ISS to transporter)
>
> Several 100 doesn't seem so many to manually review (at least to make
> sure they satisfy the criteria above). It would probably save time in
> the long run....(I'm also amazed there are so many good 'predictions'
> for S. cerevisiae which are unnannotated already?).
>
> For these reasons, pending any long term solution, I'd prefer RCA
> which
> were not reviewed by a curator to be classed as 'electronically
> inferred' because they are essentially "automated".
>
> My 2p
>
> Val
>
>
> On Sun, 30 Mar 2008, Suzanna Lewis wrote:
>
>
>> This is very much along the lines that I've been trying to foster
>> (remember the meeting in Cambridge at Jesus College). The bit-code
>> (or
>> bar-code) for evidence codes, with each bit indicating one of these
>> flags for a different piece of information. Not only automated/
>> manual,
>> but also large-scale/small-scale, and other characteristics of the
>> evidence.
>>
>> As Kara (and many others) have said, there is quite a bit of over-
>> loading of multiple pieces of information in the current evidence
>> codes. It would be nice one day to see these distinguished into
>> different constituent bits of information.
>>
>> -S
>>
>> p.s. I thought that IEA did not -require- the with column.
>> p.p.s Was the decision tree a step in this direction?
>>
>> On Mar 26, 2008, at 1:59 PM, Kara Dolinski wrote:
>>
>>
>>> Hi,
>>>
>>> The root of the problem, as I see it, is that we are mixing apples
>>> and oranges with evidence codes. All but one of the evidence codes
>>> indicate the type of experimental evidence for a GO annotation, but
>>> we have one oddball, IEA, that indicates not what the experiment is,
>>> but rather how the annotation was done. We keep running into
>>> variations of the same problem: we have some evidence (whether
>>> experimental or computational) for a GO annotation, but also want to
>>> indicate whether a curator looked at it or not.
>>>
>>> My proposed (albeit radical) solution:
>>>
>>> Remove IEA as an evidence code.
>>>
>>> Create a new property for GO annotations (or add a new type of
>>> qualifier) that captures how the annotation was done: manual or
>>> automated.
>>>
>>> Everything that is currently IEA would be given the 'automated'
>>> property/qualifier, and then would be given a new evidence code as
>>> appropriate (mostly a flavor of ISS I would assume).
>>> There can be a rule that all 'automated' annotations that are a
>>> flavor of ISS must have a 'with' value.
>>>
>>> This would allow us to use 'RCA' as appropriate, in some cases
>>> they'd be 'manual', in others, they'd be 'automated'. In Rama's
>>> case, the annotations would be 'RCA' with an 'automated' qualifier.
>>>
>>> I realize the issues involved in making such a drastic change, so I
>>> understand if we don't go there, but I do think that some approach
>>> such as the one above is the best representation of the information
>>> that we are trying to capture.
>>>
>>> Cheers,
>>> Kara
>>>
>>> On Mar 26, 2008, at 4:30 PM, Rama Balakrishnan wrote:
>>>
>>>
>>>> Hi All,
>>>>
>>>> SGD has come across couple of computationally predicted GO
>>>> annotation data sets for S. cerevisiae that we would like to add to
>>>> our database. The GO annotations from these data sets are
>>>> predictions based on multiple high-throughput data sets. RCA
>>>> evidence code came to our minds but according to the documentation,
>>>> the annotations all have to be manually reviewed by a curator to
>>>> use this evidence. There are several 100 annotations of this kind
>>>> and it is not feasible for us to manually review these annotations.
>>>>
>>>> Hence, we thought these annotations can be bulk loaded with IEA
>>>> evidence code. However, in the Jan 2007 (Cambridge) GO meeting, it
>>>> was decided that the 'with' column information has to be filled in
>>>> for all IEAs (else Mike's filtering script strips them out). But
>>>> these GO annotations being predictions based on multiple high-
>>>> throughput data sets, don't have any information for the with
>>>> column. So, we are left with no choice.
>>>>
>>>> Which evidence code do people think should be used for these kinds
>>>> of computational datasets when there is not an obvious "with"?
>>>>
>>>> Thanks for your input.
>>>>
>>>> Rama
>>>>
> _______________________________________________
> Annotation mailing list
> Annotation at geneontology.org
> http://fafner.stanford.edu/mailman/listinfo/annotation
>
More information about the Annotation
mailing list