[go] Requirement for all 'unknown' annotations to use ND code

Karen Christie kchris at genome.Stanford.EDU
Wed Sep 12 14:05:19 PDT 2007


Responding only to this portion of the discussion:

>> Valerie Wood (22 Jun 2007)
>>>
>>>  3. I'm pretty sure that when the unknowns disappeared, we advised
>>>  software developers that they could retrieve the unknown annotations
>>>  using the ND evidence code.....

On Wed, 12 Sep 2007, Chris Mungall wrote:
>
> I hope not!


Actually, as I understand the reasons stated at the Jan GO meeting for 
disallowing unknown/root annotations to be made with any other code than 
ND was so that we --could-- advise software developers that they could 
retrieve the unknown annotations using the ND evidence code.

I think that the evidence code statements should be only about the type of 
evidence, not about anything else. They are not statements of quality and 
I don't think that we should encode additional meaning into ND by saying 
that all unknown/root annotations will be made with this code. There are 
occasionally times where using an author statement is valid.

-Karen



>
> On Sep 11, 2007, at 12:17 AM, Karen Christie wrote:
>
>> Requirement for all 'unknown' annotations to use ND code
>> -----------------------------------------------------------
>> 
>> Hi all,
>> 
>> A question was brought up about the requirement that ND be the only
>> evidence code allowed for (unknown) annotations to the root nodes
>> within the Evidence Code Committee, and was not resolved
>> there. Discussion so far on the list is also mixed.
>> 
>> To me, the issue is that is at the Jan GO meeting we agreed that
>> evidence codes are ONLY about the type of evidence used to make the
>> annotation, and not about anything else. However, by saying that
>> people can use the ND evidence code as a way to find all the unknown
>> annotations, we are encoding an extra meaning into it.
>> 
>> The email discussion of this issue is below.
>> 
>> -Karen
>> 
>> 
>> Requirement that ND be the only allowable evidence code for unknown
>> annotations
>> 
>> proposed new rule for ND:
>>  Even if an author states in a paper that there is no data available or
>>  nothing is known about the gene product in a particular GO aspect,
>>  annotation to the corresponding root node should be made with ND
>>  evidence code citing either the annotating group's internal reference
>>  or the GOC's reference on use of the ND evidence code, not a specific
>>  paper.
>> 
>> comment in red in draft document:
>>  I realize that we agreed to the above statement at the last GOC
>>  meeting, but...
>>
>>  The more I think about it, the more I'm uncomfortable with the
>>  decision that we made that unknown annotations can only be made with
>>  ND, especially since the reason stated to do so has nothing to do with
>>  evidence, but is to help people better identify the unknown
>>  annotations.
>>
>>  I think this is encoding information into the evidence code that is
>>  about something other than the evidence itself. I think this is poor
>>  practice, especially when we spent so much time at the Jan GO meeting
>>  discussing that evidence codes would be JUST a statement of the method
>>  by which the annotation was made.
>> 
>> 
>> Jane Lomax (15 Jun 2007)
>>
>>  I was under the impression that we'd agreed 2. at the Jan meeting
>>  i.e. ND is now the only allowable evidence code for unknown
>>  annotations?
>> 
>> 
>> Midori Harris (15 Jun 2007)
>>
>>  I understand, and would add that it also loses the information that at
>>  the time of writing, the authors -- who are presumably pretty well
>>  informed about the genes/gene products they study --are aware of no
>>  relevant data.  (Tho this concern is not as grave as that of
>>  overloading an evidence code.)
>> 
>> 
>> Valerie Wood (22 Jun 2007)
>>>
>>>  I'm not so sure because:
>>>
>>>  1. If authors have specifically asserted that there is no information,
>>>  this is usually a statement which is made based on looking at the
>>>  database (for example if the author is dealing with a gene set).
>>>
>>>  2. Papers are frequently published concurrently and it is clear that
>>>  the authors have no knowledge of the parallel papers, so an author
>>>  statement is not always necessarily a good indication that there is no
>>>  functional data without a curator check.
>>>
>>>  3. I'm pretty sure that when the unknowns disappeared, we advised
>>>  software developers that they could retrieve the unknown annotations
>>>  using the ND evidence code.....
>
> I hope not!
>
>>>  Although I agree it seems bad practice to put info in the evidence
>>>  code other than the evidence itself, I think its more important that
>>>  there is a very clear way to identify 'unknown' annotations.
>
> I think the current practice is reasonably clear.
>
> Software has an unambiguous way to find genes that have been studied and 
> nothing specifically is known about their process, function or localization, 
> without use of an evidence code. The software can then take appropriate 
> action when reporting these to the user. I would argue that the software 
> doesn't have to do much, other than a little cosmetic enhancement to explain 
> what the implications of direct annotation to root are.
>
>>>  It seems like not many of the softwares have caught up with the
>>>  previous change to unknowns (for example I havn't yet managed to find
>>>  a way to look at GO term enrichment which recognises the unknown
>>>  annotations.... does anybody know of one?)
>
> What changes are required? I think most software will just do the right thing 
> by default, without requiring modifications or special-case handling.
>
> I think GO::TermFinder recognises them, in that it reports any genes in the 
> set annotated to the root of BP as being annotated to "biological process 
> unknown". However, this is just a cosmetic reporting issue, I don't think 
> GO::TF has any special case code when it comes to calculating p-values. I 
> think the GO::TF behaviour is correct (although I would split hairs and argue 
> that it should report direct annotations to root as something like "directly 
> annotated to biological process, which means the gene has been examined an 
> nothing is known specifically about the biological processes it participates 
> in").
>
> In the old GO, using some tools, it would have been possible to get the 
> unknown pseudo-term as being enriched, if the tool did not have 
> special-purpose behaviour for these terms. For example, doing an enrichment 
> analysis on:
>
> YPR204W
> YPR203W
> YPR202W
> YPR196W
> ...
>
> Would have yielded "biological process unknown" as an enriched term. I don't 
> know what GO::TF used to do here, but in the current version I tried putting 
> all ~1400 bp-unknown genes in and failed to get GO::TF to report anything 
> with a p-value. I think GO::TF has the correct behaviour here. I also think 
> that GO::TF is doing the correct thing by default here, and is not treating 
> the root node any differently.
>
> I am sure that in the past, at least some tools would have reported 
> "biological process unknown" as being enriched, and would have provided a 
> p-value. I would argue that this was dubious.
>
> With the new way of handling lack of knowledge (ie root nodes), it is much 
> harder for software to get it wrong. It was the old way that required special 
> case code.
>
> I would concede that it can sometimes be useful to know if a significant 
> proportion of a gene set were of unknown function, although here I do not 
> think it is so useful to distinguish "known unknowns" (NDs to roots) from 
> "unknown unkowns" (no annotation). I think this kind of result, if useful, 
> falls naturally out of a more general analysis of the information content of 
> a gene set, without any requirements for handling the root nodes as a special 
> case.
>
> Disclaimer: I'm no statistician. I'd welcome a wider discussion with some of 
> the enrichment tool developers.
>



More information about the Go mailing list