[go] Requirement for all 'unknown' annotations to use ND code

Pascale Gaudet pgaudet at northwestern.edu
Thu Sep 13 05:24:38 PDT 2007


David Hill wrote:
> Hi Everyone,
>
> I may be missing something, but when would we ever use ND and not 
> annotate to the root node? When would we ever annotate to the root 
> node and not use ND as an evidence code? 
I think some people use NAS for the latter case (is that right?). But 
that seems incorrect to me: If we use a paper to make the annotation to 
the root term, there is still no data.

Pascale


> I can't think of an example.
>
> David
>
>
> Karen Christie wrote:
>> Responding only to this portion of the discussion:
>>
>>>> Valerie Wood (22 Jun 2007)
>>>>>
>>>>>  3. I'm pretty sure that when the unknowns disappeared, we advised
>>>>>  software developers that they could retrieve the unknown annotations
>>>>>  using the ND evidence code.....
>>
>> On Wed, 12 Sep 2007, Chris Mungall wrote:
>>>
>>> I hope not!
>>
>>
>> Actually, as I understand the reasons stated at the Jan GO meeting 
>> for disallowing unknown/root annotations to be made with any other 
>> code than ND was so that we --could-- advise software developers that 
>> they could retrieve the unknown annotations using the ND evidence code.
>>
>> I think that the evidence code statements should be only about the 
>> type of evidence, not about anything else. They are not statements of 
>> quality and I don't think that we should encode additional meaning 
>> into ND by saying that all unknown/root annotations will be made with 
>> this code. There are occasionally times where using an author 
>> statement is valid.
>>
>> -Karen
>>
>>
>>
>>>
>>> On Sep 11, 2007, at 12:17 AM, Karen Christie wrote:
>>>
>>>> Requirement for all 'unknown' annotations to use ND code
>>>> -----------------------------------------------------------
>>>>
>>>> Hi all,
>>>>
>>>> A question was brought up about the requirement that ND be the only
>>>> evidence code allowed for (unknown) annotations to the root nodes
>>>> within the Evidence Code Committee, and was not resolved
>>>> there. Discussion so far on the list is also mixed.
>>>>
>>>> To me, the issue is that is at the Jan GO meeting we agreed that
>>>> evidence codes are ONLY about the type of evidence used to make the
>>>> annotation, and not about anything else. However, by saying that
>>>> people can use the ND evidence code as a way to find all the unknown
>>>> annotations, we are encoding an extra meaning into it.
>>>>
>>>> The email discussion of this issue is below.
>>>>
>>>> -Karen
>>>>
>>>>
>>>> Requirement that ND be the only allowable evidence code for unknown
>>>> annotations
>>>>
>>>> proposed new rule for ND:
>>>>  Even if an author states in a paper that there is no data 
>>>> available or
>>>>  nothing is known about the gene product in a particular GO aspect,
>>>>  annotation to the corresponding root node should be made with ND
>>>>  evidence code citing either the annotating group's internal reference
>>>>  or the GOC's reference on use of the ND evidence code, not a specific
>>>>  paper.
>>>>
>>>> comment in red in draft document:
>>>>  I realize that we agreed to the above statement at the last GOC
>>>>  meeting, but...
>>>>
>>>>  The more I think about it, the more I'm uncomfortable with the
>>>>  decision that we made that unknown annotations can only be made with
>>>>  ND, especially since the reason stated to do so has nothing to do 
>>>> with
>>>>  evidence, but is to help people better identify the unknown
>>>>  annotations.
>>>>
>>>>  I think this is encoding information into the evidence code that is
>>>>  about something other than the evidence itself. I think this is poor
>>>>  practice, especially when we spent so much time at the Jan GO meeting
>>>>  discussing that evidence codes would be JUST a statement of the 
>>>> method
>>>>  by which the annotation was made.
>>>>
>>>>
>>>> Jane Lomax (15 Jun 2007)
>>>>
>>>>  I was under the impression that we'd agreed 2. at the Jan meeting
>>>>  i.e. ND is now the only allowable evidence code for unknown
>>>>  annotations?
>>>>
>>>>
>>>> Midori Harris (15 Jun 2007)
>>>>
>>>>  I understand, and would add that it also loses the information 
>>>> that at
>>>>  the time of writing, the authors -- who are presumably pretty well
>>>>  informed about the genes/gene products they study --are aware of no
>>>>  relevant data.  (Tho this concern is not as grave as that of
>>>>  overloading an evidence code.)
>>>>
>>>>
>>>> Valerie Wood (22 Jun 2007)
>>>>>
>>>>>  I'm not so sure because:
>>>>>
>>>>>  1. If authors have specifically asserted that there is no 
>>>>> information,
>>>>>  this is usually a statement which is made based on looking at the
>>>>>  database (for example if the author is dealing with a gene set).
>>>>>
>>>>>  2. Papers are frequently published concurrently and it is clear that
>>>>>  the authors have no knowledge of the parallel papers, so an author
>>>>>  statement is not always necessarily a good indication that there 
>>>>> is no
>>>>>  functional data without a curator check.
>>>>>
>>>>>  3. I'm pretty sure that when the unknowns disappeared, we advised
>>>>>  software developers that they could retrieve the unknown annotations
>>>>>  using the ND evidence code.....
>>>
>>> I hope not!
>>>
>>>>>  Although I agree it seems bad practice to put info in the evidence
>>>>>  code other than the evidence itself, I think its more important that
>>>>>  there is a very clear way to identify 'unknown' annotations.
>>>
>>> I think the current practice is reasonably clear.
>>>
>>> Software has an unambiguous way to find genes that have been studied 
>>> and nothing specifically is known about their process, function or 
>>> localization, without use of an evidence code. The software can then 
>>> take appropriate action when reporting these to the user. I would 
>>> argue that the software doesn't have to do much, other than a little 
>>> cosmetic enhancement to explain what the implications of direct 
>>> annotation to root are.
>>>
>>>>>  It seems like not many of the softwares have caught up with the
>>>>>  previous change to unknowns (for example I havn't yet managed to 
>>>>> find
>>>>>  a way to look at GO term enrichment which recognises the unknown
>>>>>  annotations.... does anybody know of one?)
>>>
>>> What changes are required? I think most software will just do the 
>>> right thing by default, without requiring modifications or 
>>> special-case handling.
>>>
>>> I think GO::TermFinder recognises them, in that it reports any genes 
>>> in the set annotated to the root of BP as being annotated to 
>>> "biological process unknown". However, this is just a cosmetic 
>>> reporting issue, I don't think GO::TF has any special case code when 
>>> it comes to calculating p-values. I think the GO::TF behaviour is 
>>> correct (although I would split hairs and argue that it should 
>>> report direct annotations to root as something like "directly 
>>> annotated to biological process, which means the gene has been 
>>> examined an nothing is known specifically about the biological 
>>> processes it participates in").
>>>
>>> In the old GO, using some tools, it would have been possible to get 
>>> the unknown pseudo-term as being enriched, if the tool did not have 
>>> special-purpose behaviour for these terms. For example, doing an 
>>> enrichment analysis on:
>>>
>>> YPR204W
>>> YPR203W
>>> YPR202W
>>> YPR196W
>>> ...
>>>
>>> Would have yielded "biological process unknown" as an enriched term. 
>>> I don't know what GO::TF used to do here, but in the current version 
>>> I tried putting all ~1400 bp-unknown genes in and failed to get 
>>> GO::TF to report anything with a p-value. I think GO::TF has the 
>>> correct behaviour here. I also think that GO::TF is doing the 
>>> correct thing by default here, and is not treating the root node any 
>>> differently.
>>>
>>> I am sure that in the past, at least some tools would have reported 
>>> "biological process unknown" as being enriched, and would have 
>>> provided a p-value. I would argue that this was dubious.
>>>
>>> With the new way of handling lack of knowledge (ie root nodes), it 
>>> is much harder for software to get it wrong. It was the old way that 
>>> required special case code.
>>>
>>> I would concede that it can sometimes be useful to know if a 
>>> significant proportion of a gene set were of unknown function, 
>>> although here I do not think it is so useful to distinguish "known 
>>> unknowns" (NDs to roots) from "unknown unkowns" (no annotation). I 
>>> think this kind of result, if useful, falls naturally out of a more 
>>> general analysis of the information content of a gene set, without 
>>> any requirements for handling the root nodes as a special case.
>>>
>>> Disclaimer: I'm no statistician. I'd welcome a wider discussion with 
>>> some of the enrichment tool developers.
>>>
>
>
>




More information about the Go mailing list