[go] Requirement for all 'unknown' annotations to use ND code
David Hill
dph at informatics.jax.org
Thu Sep 13 04:49:02 PDT 2007
Hi Everyone,
I may be missing something, but when would we ever use ND and not
annotate to the root node? When would we ever annotate to the root node
and not use ND as an evidence code? I can't think of an example.
David
Karen Christie wrote:
> Responding only to this portion of the discussion:
>
>>> Valerie Wood (22 Jun 2007)
>>>>
>>>> 3. I'm pretty sure that when the unknowns disappeared, we advised
>>>> software developers that they could retrieve the unknown annotations
>>>> using the ND evidence code.....
>
> On Wed, 12 Sep 2007, Chris Mungall wrote:
>>
>> I hope not!
>
>
> Actually, as I understand the reasons stated at the Jan GO meeting for
> disallowing unknown/root annotations to be made with any other code
> than ND was so that we --could-- advise software developers that they
> could retrieve the unknown annotations using the ND evidence code.
>
> I think that the evidence code statements should be only about the
> type of evidence, not about anything else. They are not statements of
> quality and I don't think that we should encode additional meaning
> into ND by saying that all unknown/root annotations will be made with
> this code. There are occasionally times where using an author
> statement is valid.
>
> -Karen
>
>
>
>>
>> On Sep 11, 2007, at 12:17 AM, Karen Christie wrote:
>>
>>> Requirement for all 'unknown' annotations to use ND code
>>> -----------------------------------------------------------
>>>
>>> Hi all,
>>>
>>> A question was brought up about the requirement that ND be the only
>>> evidence code allowed for (unknown) annotations to the root nodes
>>> within the Evidence Code Committee, and was not resolved
>>> there. Discussion so far on the list is also mixed.
>>>
>>> To me, the issue is that is at the Jan GO meeting we agreed that
>>> evidence codes are ONLY about the type of evidence used to make the
>>> annotation, and not about anything else. However, by saying that
>>> people can use the ND evidence code as a way to find all the unknown
>>> annotations, we are encoding an extra meaning into it.
>>>
>>> The email discussion of this issue is below.
>>>
>>> -Karen
>>>
>>>
>>> Requirement that ND be the only allowable evidence code for unknown
>>> annotations
>>>
>>> proposed new rule for ND:
>>> Even if an author states in a paper that there is no data available or
>>> nothing is known about the gene product in a particular GO aspect,
>>> annotation to the corresponding root node should be made with ND
>>> evidence code citing either the annotating group's internal reference
>>> or the GOC's reference on use of the ND evidence code, not a specific
>>> paper.
>>>
>>> comment in red in draft document:
>>> I realize that we agreed to the above statement at the last GOC
>>> meeting, but...
>>>
>>> The more I think about it, the more I'm uncomfortable with the
>>> decision that we made that unknown annotations can only be made with
>>> ND, especially since the reason stated to do so has nothing to do with
>>> evidence, but is to help people better identify the unknown
>>> annotations.
>>>
>>> I think this is encoding information into the evidence code that is
>>> about something other than the evidence itself. I think this is poor
>>> practice, especially when we spent so much time at the Jan GO meeting
>>> discussing that evidence codes would be JUST a statement of the method
>>> by which the annotation was made.
>>>
>>>
>>> Jane Lomax (15 Jun 2007)
>>>
>>> I was under the impression that we'd agreed 2. at the Jan meeting
>>> i.e. ND is now the only allowable evidence code for unknown
>>> annotations?
>>>
>>>
>>> Midori Harris (15 Jun 2007)
>>>
>>> I understand, and would add that it also loses the information that at
>>> the time of writing, the authors -- who are presumably pretty well
>>> informed about the genes/gene products they study --are aware of no
>>> relevant data. (Tho this concern is not as grave as that of
>>> overloading an evidence code.)
>>>
>>>
>>> Valerie Wood (22 Jun 2007)
>>>>
>>>> I'm not so sure because:
>>>>
>>>> 1. If authors have specifically asserted that there is no
>>>> information,
>>>> this is usually a statement which is made based on looking at the
>>>> database (for example if the author is dealing with a gene set).
>>>>
>>>> 2. Papers are frequently published concurrently and it is clear that
>>>> the authors have no knowledge of the parallel papers, so an author
>>>> statement is not always necessarily a good indication that there
>>>> is no
>>>> functional data without a curator check.
>>>>
>>>> 3. I'm pretty sure that when the unknowns disappeared, we advised
>>>> software developers that they could retrieve the unknown annotations
>>>> using the ND evidence code.....
>>
>> I hope not!
>>
>>>> Although I agree it seems bad practice to put info in the evidence
>>>> code other than the evidence itself, I think its more important that
>>>> there is a very clear way to identify 'unknown' annotations.
>>
>> I think the current practice is reasonably clear.
>>
>> Software has an unambiguous way to find genes that have been studied
>> and nothing specifically is known about their process, function or
>> localization, without use of an evidence code. The software can then
>> take appropriate action when reporting these to the user. I would
>> argue that the software doesn't have to do much, other than a little
>> cosmetic enhancement to explain what the implications of direct
>> annotation to root are.
>>
>>>> It seems like not many of the softwares have caught up with the
>>>> previous change to unknowns (for example I havn't yet managed to find
>>>> a way to look at GO term enrichment which recognises the unknown
>>>> annotations.... does anybody know of one?)
>>
>> What changes are required? I think most software will just do the
>> right thing by default, without requiring modifications or
>> special-case handling.
>>
>> I think GO::TermFinder recognises them, in that it reports any genes
>> in the set annotated to the root of BP as being annotated to
>> "biological process unknown". However, this is just a cosmetic
>> reporting issue, I don't think GO::TF has any special case code when
>> it comes to calculating p-values. I think the GO::TF behaviour is
>> correct (although I would split hairs and argue that it should report
>> direct annotations to root as something like "directly annotated to
>> biological process, which means the gene has been examined an nothing
>> is known specifically about the biological processes it participates
>> in").
>>
>> In the old GO, using some tools, it would have been possible to get
>> the unknown pseudo-term as being enriched, if the tool did not have
>> special-purpose behaviour for these terms. For example, doing an
>> enrichment analysis on:
>>
>> YPR204W
>> YPR203W
>> YPR202W
>> YPR196W
>> ...
>>
>> Would have yielded "biological process unknown" as an enriched term.
>> I don't know what GO::TF used to do here, but in the current version
>> I tried putting all ~1400 bp-unknown genes in and failed to get
>> GO::TF to report anything with a p-value. I think GO::TF has the
>> correct behaviour here. I also think that GO::TF is doing the correct
>> thing by default here, and is not treating the root node any
>> differently.
>>
>> I am sure that in the past, at least some tools would have reported
>> "biological process unknown" as being enriched, and would have
>> provided a p-value. I would argue that this was dubious.
>>
>> With the new way of handling lack of knowledge (ie root nodes), it is
>> much harder for software to get it wrong. It was the old way that
>> required special case code.
>>
>> I would concede that it can sometimes be useful to know if a
>> significant proportion of a gene set were of unknown function,
>> although here I do not think it is so useful to distinguish "known
>> unknowns" (NDs to roots) from "unknown unkowns" (no annotation). I
>> think this kind of result, if useful, falls naturally out of a more
>> general analysis of the information content of a gene set, without
>> any requirements for handling the root nodes as a special case.
>>
>> Disclaimer: I'm no statistician. I'd welcome a wider discussion with
>> some of the enrichment tool developers.
>>
More information about the Go
mailing list