[go] Requirement for all 'unknown' annotations to use ND code
Pascale Gaudet
pgaudet at northwestern.edu
Thu Sep 13 05:24:38 PDT 2007
David Hill wrote:
> Hi Everyone,
>
> I may be missing something, but when would we ever use ND and not
> annotate to the root node? When would we ever annotate to the root
> node and not use ND as an evidence code?
I think some people use NAS for the latter case (is that right?). But
that seems incorrect to me: If we use a paper to make the annotation to
the root term, there is still no data.
Pascale
> I can't think of an example.
>
> David
>
>
> Karen Christie wrote:
>> Responding only to this portion of the discussion:
>>
>>>> Valerie Wood (22 Jun 2007)
>>>>>
>>>>> 3. I'm pretty sure that when the unknowns disappeared, we advised
>>>>> software developers that they could retrieve the unknown annotations
>>>>> using the ND evidence code.....
>>
>> On Wed, 12 Sep 2007, Chris Mungall wrote:
>>>
>>> I hope not!
>>
>>
>> Actually, as I understand the reasons stated at the Jan GO meeting
>> for disallowing unknown/root annotations to be made with any other
>> code than ND was so that we --could-- advise software developers that
>> they could retrieve the unknown annotations using the ND evidence code.
>>
>> I think that the evidence code statements should be only about the
>> type of evidence, not about anything else. They are not statements of
>> quality and I don't think that we should encode additional meaning
>> into ND by saying that all unknown/root annotations will be made with
>> this code. There are occasionally times where using an author
>> statement is valid.
>>
>> -Karen
>>
>>
>>
>>>
>>> On Sep 11, 2007, at 12:17 AM, Karen Christie wrote:
>>>
>>>> Requirement for all 'unknown' annotations to use ND code
>>>> -----------------------------------------------------------
>>>>
>>>> Hi all,
>>>>
>>>> A question was brought up about the requirement that ND be the only
>>>> evidence code allowed for (unknown) annotations to the root nodes
>>>> within the Evidence Code Committee, and was not resolved
>>>> there. Discussion so far on the list is also mixed.
>>>>
>>>> To me, the issue is that is at the Jan GO meeting we agreed that
>>>> evidence codes are ONLY about the type of evidence used to make the
>>>> annotation, and not about anything else. However, by saying that
>>>> people can use the ND evidence code as a way to find all the unknown
>>>> annotations, we are encoding an extra meaning into it.
>>>>
>>>> The email discussion of this issue is below.
>>>>
>>>> -Karen
>>>>
>>>>
>>>> Requirement that ND be the only allowable evidence code for unknown
>>>> annotations
>>>>
>>>> proposed new rule for ND:
>>>> Even if an author states in a paper that there is no data
>>>> available or
>>>> nothing is known about the gene product in a particular GO aspect,
>>>> annotation to the corresponding root node should be made with ND
>>>> evidence code citing either the annotating group's internal reference
>>>> or the GOC's reference on use of the ND evidence code, not a specific
>>>> paper.
>>>>
>>>> comment in red in draft document:
>>>> I realize that we agreed to the above statement at the last GOC
>>>> meeting, but...
>>>>
>>>> The more I think about it, the more I'm uncomfortable with the
>>>> decision that we made that unknown annotations can only be made with
>>>> ND, especially since the reason stated to do so has nothing to do
>>>> with
>>>> evidence, but is to help people better identify the unknown
>>>> annotations.
>>>>
>>>> I think this is encoding information into the evidence code that is
>>>> about something other than the evidence itself. I think this is poor
>>>> practice, especially when we spent so much time at the Jan GO meeting
>>>> discussing that evidence codes would be JUST a statement of the
>>>> method
>>>> by which the annotation was made.
>>>>
>>>>
>>>> Jane Lomax (15 Jun 2007)
>>>>
>>>> I was under the impression that we'd agreed 2. at the Jan meeting
>>>> i.e. ND is now the only allowable evidence code for unknown
>>>> annotations?
>>>>
>>>>
>>>> Midori Harris (15 Jun 2007)
>>>>
>>>> I understand, and would add that it also loses the information
>>>> that at
>>>> the time of writing, the authors -- who are presumably pretty well
>>>> informed about the genes/gene products they study --are aware of no
>>>> relevant data. (Tho this concern is not as grave as that of
>>>> overloading an evidence code.)
>>>>
>>>>
>>>> Valerie Wood (22 Jun 2007)
>>>>>
>>>>> I'm not so sure because:
>>>>>
>>>>> 1. If authors have specifically asserted that there is no
>>>>> information,
>>>>> this is usually a statement which is made based on looking at the
>>>>> database (for example if the author is dealing with a gene set).
>>>>>
>>>>> 2. Papers are frequently published concurrently and it is clear that
>>>>> the authors have no knowledge of the parallel papers, so an author
>>>>> statement is not always necessarily a good indication that there
>>>>> is no
>>>>> functional data without a curator check.
>>>>>
>>>>> 3. I'm pretty sure that when the unknowns disappeared, we advised
>>>>> software developers that they could retrieve the unknown annotations
>>>>> using the ND evidence code.....
>>>
>>> I hope not!
>>>
>>>>> Although I agree it seems bad practice to put info in the evidence
>>>>> code other than the evidence itself, I think its more important that
>>>>> there is a very clear way to identify 'unknown' annotations.
>>>
>>> I think the current practice is reasonably clear.
>>>
>>> Software has an unambiguous way to find genes that have been studied
>>> and nothing specifically is known about their process, function or
>>> localization, without use of an evidence code. The software can then
>>> take appropriate action when reporting these to the user. I would
>>> argue that the software doesn't have to do much, other than a little
>>> cosmetic enhancement to explain what the implications of direct
>>> annotation to root are.
>>>
>>>>> It seems like not many of the softwares have caught up with the
>>>>> previous change to unknowns (for example I havn't yet managed to
>>>>> find
>>>>> a way to look at GO term enrichment which recognises the unknown
>>>>> annotations.... does anybody know of one?)
>>>
>>> What changes are required? I think most software will just do the
>>> right thing by default, without requiring modifications or
>>> special-case handling.
>>>
>>> I think GO::TermFinder recognises them, in that it reports any genes
>>> in the set annotated to the root of BP as being annotated to
>>> "biological process unknown". However, this is just a cosmetic
>>> reporting issue, I don't think GO::TF has any special case code when
>>> it comes to calculating p-values. I think the GO::TF behaviour is
>>> correct (although I would split hairs and argue that it should
>>> report direct annotations to root as something like "directly
>>> annotated to biological process, which means the gene has been
>>> examined an nothing is known specifically about the biological
>>> processes it participates in").
>>>
>>> In the old GO, using some tools, it would have been possible to get
>>> the unknown pseudo-term as being enriched, if the tool did not have
>>> special-purpose behaviour for these terms. For example, doing an
>>> enrichment analysis on:
>>>
>>> YPR204W
>>> YPR203W
>>> YPR202W
>>> YPR196W
>>> ...
>>>
>>> Would have yielded "biological process unknown" as an enriched term.
>>> I don't know what GO::TF used to do here, but in the current version
>>> I tried putting all ~1400 bp-unknown genes in and failed to get
>>> GO::TF to report anything with a p-value. I think GO::TF has the
>>> correct behaviour here. I also think that GO::TF is doing the
>>> correct thing by default here, and is not treating the root node any
>>> differently.
>>>
>>> I am sure that in the past, at least some tools would have reported
>>> "biological process unknown" as being enriched, and would have
>>> provided a p-value. I would argue that this was dubious.
>>>
>>> With the new way of handling lack of knowledge (ie root nodes), it
>>> is much harder for software to get it wrong. It was the old way that
>>> required special case code.
>>>
>>> I would concede that it can sometimes be useful to know if a
>>> significant proportion of a gene set were of unknown function,
>>> although here I do not think it is so useful to distinguish "known
>>> unknowns" (NDs to roots) from "unknown unkowns" (no annotation). I
>>> think this kind of result, if useful, falls naturally out of a more
>>> general analysis of the information content of a gene set, without
>>> any requirements for handling the root nodes as a special case.
>>>
>>> Disclaimer: I'm no statistician. I'd welcome a wider discussion with
>>> some of the enrichment tool developers.
>>>
>
>
>
More information about the Go
mailing list