[go] NOT annotations
Chris Mungall
cjm at fruitfly.org
Fri Feb 1 16:01:08 PST 2008
On Feb 1, 2008, at 1:48 PM, Benjamin Hitz wrote:
>
> On Feb 1, 2008, at 12:47 PM, Judith Blake wrote:
>
>>
>> I think by far the most common case right now is not with
>> isoforms, but with conflicting data. And that should be represented.
>
> The following is a philosophical argument, and as such many have
> limited bearing on biology.
>
> I wonder if the value of reporting these is overstated. To head
> off an obvious counter argument, it's data, and valid data should
> not be thrown out - but since we do not curate to an infinite
> depth, we are already "throwing out" information so it's really a
> matter of where you draw the line.
and the line should presumably be drawn by the curator?
we are already throwing out too much information - see for example
the additional information recorded by MGI [some of which we'll have
a place for in the GAFs Real Soon Now]
In addition, the cost/benefit ratio the curator should apply should
be one of possible benefit to effort expended. I don't think the
curator should worry about people misinterpreting what they say
(given the curator is using a standard that is unambiguous)
> It is of course true that a NOT annotation represents both
> experimental and curational work indicating "something" was done.
> But I think that reporting gene_product x term associations via a
> positive standard only has it's merits.
>
> Mainly what I am thinking about is the coverage of negative
> experiments reported in the literature and some theoretical
> "completeness" of the GO/annotation system.
>
> In principle, at some theoretical level - if you are using any sort
> of negative standard, you would have to assert (technically) that
> for gene product X it is NOT (by experiment) all GO terms that it
> isn't. I think that GO associations (and biology in general) has
> an implict not.
This is not true. You are making the closed world assumption here:
lack of knowledge == not true.
The CWA is fine for modeling, say, engineered artefacts like
aeroplanes, but is problematic for science where we rarely not have
complete knowledge of the whole system.
Biological information systems and GO annotations in particular have
to be based on the open world assumption. The OWA does not behoove
you do go around making negative statements for things you do not know
http://en.wikipedia.org/wiki/Closed_world_assumption
> For some protein annotated as protein kinase, it is NOT a
> glycohydrolase unless someone publishes (and a curator curates) an
> experiment indicating that it is, in fact, both.
Either it is or isn't a glycohydrolase. protein kinases generally
don't care what humans publish about them.
It may be valid to say that we cannot assume it is a glycohydrolase
until we have evidence for this - but this is not the same as saying
it is not a glycohydrolase. The statements are different from both
common-sense and logical points of view.
> Couple this with the fact that many (I would venture to say MOST)
> negative results do not get published, you have a very small number
> of "valid, useful, explicit" NOT associations (such as conflicting
> experiments) relative to positive associations.
The lack of reporting of negative results is a big problem
Hopefully not forever:
http://www.jnrbm.com/
I think GO should encourage reporting of negative results
Anyway, I don't think it has to be an entirely negative result to
yield a NOT annotation.
Taking an article title from pubmed at random: "Lysine-{alpha}-
Ketoglutarate Reductase and Saccharopine Dehydrogenase Are Located
Only in the Mitochondrial Matrix in Rat Liver"
I haven't read it let alone curated it, but there is an implicit NOT
there (we don't currently capture these kinds of NOTs but I see no
reason why we shouldn't)
> What this means, to me, is that IN THE AGGREGATE NOT annotations
> are not very useful to the community at large. The probability of
> someone misinterperting a given NOT experiment is vastly greater
> than someone finding it useful.
How did you arrive at this conclusion? I mean it may be true, but we
don't have any evidence one way or another.
(I'm assuming we are talking about the presentation to a user through
a web page here)
> As an imperfect analogy - if you have a genetic disorder that
> occurs 1 in 1,000,000 persons, and a test that gives a false
> positive result 5% of the time, is the test useful? (ANSWER it's
> not, because the ratio of false positives to true positives is
> roughly 47,000:1.
>
> Lest you think those numbers are not in the correct ball park,
> there are 3634 qualified associations in the January gofull. There
> were ~730000 non-IEA associations and over 19 million associations
> including IEAs.
As you say, the analogy is flawed.
> This is why I think all "qualified" associations should be in
> separate files, and never shown by default on interfaces.
I don't have massive objections to your conclusions - even though I
disagree with the arguments above. I think there is always an
argument for presenting information in a way that minimises confusion.
I'm not sure how the not-showing-not-by-default would work. In a
previous email you said:
> Then we can chose a display option best suited to the "advanced
> user" (which is pretty much GOC people only), and not worry about
> fooling or confusing a newbie
I am uncomfortable with this suggestion -- I think there should be a
single unified consistent view of the data. Having divergent views
can lead to even more confusion. Two users should be able to share a
URL and see the same content. By all means given them control over
their color schemes - but I don't like the parental control /
filtering idea.
Even filtering information from the default view seems kind of icky.
It's like having a special newbie edition of a journal with the hard-
to-understand bits of evidence removed. [ok, I'm allowed flawed
analogies too..].
I think you underestimate the users. I think most people can grasp
the idea of negative evidence, provided it is presented clearly. And
just because people may get a bit confused when presented with
something new (actually, here it's not even anything new), it doesn't
mean they will always be confused. 10 years ago, before the GO,
99.99% of biologists had no clue / did not care what and ontology
was. Now that figure is now perhaps 99% [*]. That's massive progress.
But even so, an astonishing amount of people don't get the basics.
Many databases and tools that allow querying by GO don't even use the
DAG. Is that an argument for dumbing down and annotating to a flat
list? After all, these databases are delivering misleading results,
with possibly worse consequences than your misinterpretation-of-NOT
example.
GO has been an active force in educating users - there's no reason
why after 10 years we should become reactive and shy away from doing
something because it has the potential to confuse a few newbies. I am
optimistic about educating people.
One thing we could do to educate people is to stop calling NOT a
qualifier. If I say that intelligent design does NOT hold, I am not
qualifying my support for the theory. The definition of qualified is:
limited or modified in some way <properly qualified conclusions --
W.J.Reilly> <the author's outlook ... is one of qualified optimism
-- P.B.Sears>; specifically : modified by the attachment of
conditions <a qualified acceptance of a bill of exchange>
We can't call it a qualifier and then act surprised when people
ignore it; after all, all statements are implicitly qualified in one
way or another.
On the other hand, I'm totally fine with separate having files, *so
long as* any filtering of content is clearly signaled in the
filename / url.
Files are different since they are read by dumb programs (written by
sometimes sloppy programmers) rather than intelligent humans.
I'm not sure how strongly I'll feel about this after the weekend. I
think I just have a purists dislike of the idea of dumbing down
content, in general. (Having said that, I will probably start
watching the new series of LOST).
Cheers
Chris
[*] utterly fabricated numbers. No slight intended - I imagine
biologists may be better informed than computer scientists here
> Ben
>
>> Pankaj Jaiswal wrote:
>>>
>>>
>>> Judith Blake wrote:
>>>> hummm
>>>> I think the case that Harold was saying, and that we currently
>>>> have in other annotations here at MGI, is that we have
>>>>
>>>> x A
>>>> x NOT A
>>>>
>>>> both lines of evidence exist at this point.
>>>>
>>>> In some cases, different experiments give different results
>>>>
>>>> In the case that Harold discussed, the issue really was that we
>>>> don't properly distinguish isoforms, so from the gene level, the
>>>> two are combined whereas if you could represent each isoform,
>>>> the one would be x-1 A and one would be x-2 NOT A.
>>>>
>>>> judy
>>>>
>>>
>>> I think unless the correct object_type is not defined in the
>>> annotations, it may not be very obvious. Ideally annotations
>>> should not be done to the gene, but to the transcripts/proteins
>>> or their isoforms (means all coming from the same gene/loci on
>>> the genome).
>>>
>>> So it can be
>>>
>>> x.1 (object_type: protein isoform) and x.2 (object_type: protein
>>> isoform) map to x (object_type: gene)
>>>
>>> Annotations
>>>
>>> x.1 A
>>> x.2 NOT A
>>>
>>> Or
>>>
>>> x A With x.1
>>> x NOT A With x.2
>>>
>>
>
> --
> Ben Hitz
> Senior Scientific Programmer ** Saccharomyces Genome Database ** GO
> Consortium
> Stanford University ** hitz at genome.stanford.edu
>
>
>
>
More information about the Go
mailing list