[go] Requirement for all 'unknown' annotations to use ND code
Karen Christie
kchris at genome.Stanford.EDU
Wed Sep 12 14:05:19 PDT 2007
Responding only to this portion of the discussion:
>> Valerie Wood (22 Jun 2007)
>>>
>>> 3. I'm pretty sure that when the unknowns disappeared, we advised
>>> software developers that they could retrieve the unknown annotations
>>> using the ND evidence code.....
On Wed, 12 Sep 2007, Chris Mungall wrote:
>
> I hope not!
Actually, as I understand the reasons stated at the Jan GO meeting for
disallowing unknown/root annotations to be made with any other code than
ND was so that we --could-- advise software developers that they could
retrieve the unknown annotations using the ND evidence code.
I think that the evidence code statements should be only about the type of
evidence, not about anything else. They are not statements of quality and
I don't think that we should encode additional meaning into ND by saying
that all unknown/root annotations will be made with this code. There are
occasionally times where using an author statement is valid.
-Karen
>
> On Sep 11, 2007, at 12:17 AM, Karen Christie wrote:
>
>> Requirement for all 'unknown' annotations to use ND code
>> -----------------------------------------------------------
>>
>> Hi all,
>>
>> A question was brought up about the requirement that ND be the only
>> evidence code allowed for (unknown) annotations to the root nodes
>> within the Evidence Code Committee, and was not resolved
>> there. Discussion so far on the list is also mixed.
>>
>> To me, the issue is that is at the Jan GO meeting we agreed that
>> evidence codes are ONLY about the type of evidence used to make the
>> annotation, and not about anything else. However, by saying that
>> people can use the ND evidence code as a way to find all the unknown
>> annotations, we are encoding an extra meaning into it.
>>
>> The email discussion of this issue is below.
>>
>> -Karen
>>
>>
>> Requirement that ND be the only allowable evidence code for unknown
>> annotations
>>
>> proposed new rule for ND:
>> Even if an author states in a paper that there is no data available or
>> nothing is known about the gene product in a particular GO aspect,
>> annotation to the corresponding root node should be made with ND
>> evidence code citing either the annotating group's internal reference
>> or the GOC's reference on use of the ND evidence code, not a specific
>> paper.
>>
>> comment in red in draft document:
>> I realize that we agreed to the above statement at the last GOC
>> meeting, but...
>>
>> The more I think about it, the more I'm uncomfortable with the
>> decision that we made that unknown annotations can only be made with
>> ND, especially since the reason stated to do so has nothing to do with
>> evidence, but is to help people better identify the unknown
>> annotations.
>>
>> I think this is encoding information into the evidence code that is
>> about something other than the evidence itself. I think this is poor
>> practice, especially when we spent so much time at the Jan GO meeting
>> discussing that evidence codes would be JUST a statement of the method
>> by which the annotation was made.
>>
>>
>> Jane Lomax (15 Jun 2007)
>>
>> I was under the impression that we'd agreed 2. at the Jan meeting
>> i.e. ND is now the only allowable evidence code for unknown
>> annotations?
>>
>>
>> Midori Harris (15 Jun 2007)
>>
>> I understand, and would add that it also loses the information that at
>> the time of writing, the authors -- who are presumably pretty well
>> informed about the genes/gene products they study --are aware of no
>> relevant data. (Tho this concern is not as grave as that of
>> overloading an evidence code.)
>>
>>
>> Valerie Wood (22 Jun 2007)
>>>
>>> I'm not so sure because:
>>>
>>> 1. If authors have specifically asserted that there is no information,
>>> this is usually a statement which is made based on looking at the
>>> database (for example if the author is dealing with a gene set).
>>>
>>> 2. Papers are frequently published concurrently and it is clear that
>>> the authors have no knowledge of the parallel papers, so an author
>>> statement is not always necessarily a good indication that there is no
>>> functional data without a curator check.
>>>
>>> 3. I'm pretty sure that when the unknowns disappeared, we advised
>>> software developers that they could retrieve the unknown annotations
>>> using the ND evidence code.....
>
> I hope not!
>
>>> Although I agree it seems bad practice to put info in the evidence
>>> code other than the evidence itself, I think its more important that
>>> there is a very clear way to identify 'unknown' annotations.
>
> I think the current practice is reasonably clear.
>
> Software has an unambiguous way to find genes that have been studied and
> nothing specifically is known about their process, function or localization,
> without use of an evidence code. The software can then take appropriate
> action when reporting these to the user. I would argue that the software
> doesn't have to do much, other than a little cosmetic enhancement to explain
> what the implications of direct annotation to root are.
>
>>> It seems like not many of the softwares have caught up with the
>>> previous change to unknowns (for example I havn't yet managed to find
>>> a way to look at GO term enrichment which recognises the unknown
>>> annotations.... does anybody know of one?)
>
> What changes are required? I think most software will just do the right thing
> by default, without requiring modifications or special-case handling.
>
> I think GO::TermFinder recognises them, in that it reports any genes in the
> set annotated to the root of BP as being annotated to "biological process
> unknown". However, this is just a cosmetic reporting issue, I don't think
> GO::TF has any special case code when it comes to calculating p-values. I
> think the GO::TF behaviour is correct (although I would split hairs and argue
> that it should report direct annotations to root as something like "directly
> annotated to biological process, which means the gene has been examined an
> nothing is known specifically about the biological processes it participates
> in").
>
> In the old GO, using some tools, it would have been possible to get the
> unknown pseudo-term as being enriched, if the tool did not have
> special-purpose behaviour for these terms. For example, doing an enrichment
> analysis on:
>
> YPR204W
> YPR203W
> YPR202W
> YPR196W
> ...
>
> Would have yielded "biological process unknown" as an enriched term. I don't
> know what GO::TF used to do here, but in the current version I tried putting
> all ~1400 bp-unknown genes in and failed to get GO::TF to report anything
> with a p-value. I think GO::TF has the correct behaviour here. I also think
> that GO::TF is doing the correct thing by default here, and is not treating
> the root node any differently.
>
> I am sure that in the past, at least some tools would have reported
> "biological process unknown" as being enriched, and would have provided a
> p-value. I would argue that this was dubious.
>
> With the new way of handling lack of knowledge (ie root nodes), it is much
> harder for software to get it wrong. It was the old way that required special
> case code.
>
> I would concede that it can sometimes be useful to know if a significant
> proportion of a gene set were of unknown function, although here I do not
> think it is so useful to distinguish "known unknowns" (NDs to roots) from
> "unknown unkowns" (no annotation). I think this kind of result, if useful,
> falls naturally out of a more general analysis of the information content of
> a gene set, without any requirements for handling the root nodes as a special
> case.
>
> Disclaimer: I'm no statistician. I'd welcome a wider discussion with some of
> the enrichment tool developers.
>
More information about the Go
mailing list