[go] Putting method/program names into the with field for ISS

Karen Christie kchris at genome.Stanford.EDU
Thu Sep 20 09:16:39 PDT 2007


comments inserted below

On Thu, 20 Sep 2007, Valerie Wood wrote:

>
> The "with" field should be mandatory for all sequenece simirity inferences. 
> However all the methods discussed where a sequence can not be identified
> to go in the with column, are algorithms which use other things in addition 
> to seqeunce similarity (although these may be sequence based, they are not 
> similarity).

I'm not sure I'd agree with you on the snoRNA or tRNA ones. They seem like 
they are based on similarity to the consensus for the appropriate type of 
RNA gene, there's just no ID for these.

> Would we change the evidence code from ISS to ISB or something similar to 
> indicate this.
> i.e. inferred from sequence *based* methods?
>
> I said previously that the identification of a "with" sequence could be used 
> as a distinction, I did not mean this should be the criteria, but it is an 
> indication to a curator that 'something else' is used by the algorithm, in 
> addition to sequence similarity, to produce the annotation.

I'm not really following what you mean in the above comment.

> It seem that RCA has not been considered, because most of the function 
> predictions using RCA so far have some experimental component (In fact the 
> RCA code says non-sequence-based computational method).

I think we did consider RCA for tRNAscan and snoRNAs (at the 2006 Annot 
Camp), but then rejected it at the Jan 2007 GO meeting in response to 
Michelle Gwinn's argument that everything based purely on analysis of the 
sequence of the gene product should be ISS, even if multiple types of 
sequence analysis were combined.

While it is true that the original RCA documentation did say non-sequence 
based method, at the St. Croix meeting, Sue Rhee brought up the point 
about analyses that combined non-sequence and sequence based data and we 
agreed that these could be RCA. Thus in the draft document, I've made a 
new section called ICA with proposed guidelines that I think are more in 
line with the idea that these analyses involve multiple data sets or even 
multiple kinds of data sets.

http://www-dev.yeastgenome.org/draftGO/go/www/GO.evidence.new.shtml

> Looking ahead, what happens when we get predictions which combine (for 
> example) sequence similarity with phylogentic distribution, copy number, 
> and other statistically analysed but non experimental data? Can we use 
> RCA for this? (it would not be ISS) or will the evidence codes need 
> revisiting?

> I guess my question is, what excludes tRNA scan, and TMM predictors from 
> being RCA? and, if it is because they are sequence-based, will we need 
> to revisit this all over again when we have other predictions which 
> don't fit RCA or ISS in future?

>From your brief descriptions, the other types of analyses do seem like RCA 
to me since they incorporate more than just the sequence of the gene 
product. The distinction I went on in my proposal for RCA/ICA, and the 
distinction between RCA/ICA and ISS was that if the analysis was based 
purely on teh sequence of the gene product that it should be ISS. The 
RCA/ICA analyses were varied, but included:

   - multiple sets of experimental data

   - expression data with promoter sequence data

   - experimental data with sequence-based structural predictions

   - experimental data with a mathematical model

Going forwards, I think it might be a good idea for the group, or perhaps 
even better, another Evidence Code Committee, to carefully review new 
types of analyses before we add them to our documentation and examples. 
Once we agree that something is appropriate to make annotations from and 
belongs in RCA/ICA, then it will be helpful to all if keep our 
documentation up to date and add the new examples to docs for the code.

I think the RCA code has suffered from original documentation that was 
unclear because it tried to write something very broad without a clear 
idea of what types of analyses might occur, and then a series of hasty 
decisions that has resulted in some flipflopping of the boundary between 
RCA/ICA and ISS that has been a bit confusing.

-Karen


>
>
> Val
>
>
>
>
>
> Suzanna Lewis wrote:
>
>> 
>> On Sep 19, 2007, at 10:08 PM, Karen Christie wrote:
>> 
>>> I don't think anyone is suggesting that such identifiers, including domain 
>>> and HMM identifiers as well as individual sequence identifiers, shouldn't 
>>> be put into the 'with/from' column when available.
>>> 
>>> However, there are cases when there just isn't anything of that sort to 
>>> put in this column. Both snoRNAs and tRNAs are a good example. Both of 
>>> these types of RNAs are generally predicted by methods that analyze both 
>>> the primary sequence and the predicted nucleic acid secondary structures 
>>> of the gene product, not by orthology methods. The two protein examples 
>>> were both based on algorithms that analyze sequence to determine 
>>> hydrophobicity and predict transmembrane domains. In all of these 
>>> examples, the method is clearly based purely upon the sequence of the gene 
>>> product. Thus these all fit into ISS, but there is no identifier for a 
>>> sequence, domain, or HMM that can be put into the with column.
>>> 
>>> I really think that the evidence code should be based on the method used, 
>>> not on how the 'with/from' column can be filled; this is supporting 
>>> evidence after all. In the interest of having a logical system that makes 
>>> sense, especially when teaching it to new people, I think it is important 
>>> that we don't implement arcane rules where the type of supporting evidence 
>>> takes precedence over the method used.
>>> 
>>> So, regardless of what we decide about filling the with column for these 
>>> types of situations, I think that these situations should stay in ISS 
>>> because they are clearly all methods based purely on the sequence of the 
>>> gene product. Personally, I can live with any of three options that have 
>>> come up in this thread:
>>> 
>>> 1. the system I proposed where we start maintaining a new file to track 
>>> methods, not necessarily elegant and even the 10 or so examples I used 
>>> highlight the difficulties in tracking down references for some methods, 
>>> but meets our other requirements that things have both a namespace and an 
>>> ID.
>> 
>> 
>> This is the way to go IMO
>> 
>>> 2. Allow the with column to be filled with 'not applicable', or some other 
>>> descriptive phrase, for cases when there is no ID for a sequence, domain, 
>>> HMM, etc, but just a method or sequence consensus without an ID
>> 
>> 
>> Nope
>> 
>>> 
>>> 3. Relax the rule that the with column is mandatory for ISS
>> 
>> 
>> Nope
>> 
>>> 
>>> -Karen
>>> 
>>> P.S. Could we start calling this column the 'supporting evidence' column 
>>> or something else descriptive. Right now, it's full name is 'with/from', 
>>> but we've also allowed the column to be filled for IMP where neither of 
>>> those prepositions is really appropriate.
>>> 
>>> 
>>> 
>>> 
>>> On Tue, 18 Sep 2007, Suzanna Lewis wrote:
>>> 
>>>> Actually there are (hoped for) operational reasons for requiring a 
>>>> sequence accession in the 'with' column (and if there is >1 then a 
>>>> representative one is just fine, because from there we could get to the 
>>>> other orthologs).
>>>> 
>>>> The hope is that doing this should, in theory, make it possible to build 
>>>> in triggers such that if the annotation of the sequence in the 'with' 
>>>> column changes, then this could ripple back to all the annotations that 
>>>> were dependent/derived from this original.
>>>> 
>>>> I would very much hate to see us give up on this. The GO is one of the 
>>>> few group that is even trying to indicate provenance and traceability. It 
>>>> is difficult, but very important.
>>>> 
>>>> -S
>>>> 
>>>> On Sep 12, 2007, at 8:44 AM, Susan Tweedie wrote:
>>>> 
>>>>> At the risk of returning us to square one on this... I'd like to take a
>>>>> step back and revisit why we decided it was vital to have something in
>>>>> the with column for ISS. I thought this stemmed from an attempt at
>>>>> enforcing quality annotations - we wanted to identify the similar
>>>>> 'thing' for which there is experimental evidence and to use ISS only
>>>>> where this was available. We then shifted ground a bit to acknowledge
>>>>> that there are cases where there is a strong case for ISS annotation but
>>>>> no single sequence can be identified for this column. So what do we
>>>>> actually achieve by filling-in the slot for these cases? It seems to me
>>>>> this is more to do with us saying 'yup I'm being stringent about my use
>>>>> of ISS so I've stuck something in this column to prove it' than actually
>>>>> helping users. The 'how they did it' in the the paper just like it is
>>>>> for other evidence codes. I'm not sure we 'gain' enough here to justify
>>>>> mixing methods and objects in the 'with' column and I am struggling to
>>>>> see the justification for making ISS a special case in this respect. If
>>>>> we show a method for ISS, do we set a precedent and run the risk of
>>>>> users wanting to know whether it was RNAi or knock-out for IMP etc?
>>>>> I guess I'd just like to know we haven't just made this column mandatory
>>>>> as a means of policing curators. I strongly agree that we should fill in
>>>>> a sequence where possible and do our best (within reason) to be sure
>>>>> there is an experiment there somewhere but, if we are going to accept
>>>>> that there are cases where we can't identify a suitable sequence, can't
>>>>> we just trust curator judgement i.e. leave the column blank and let
>>>>> people read the paper to see details of how it was done?
>>>>> If we stick with the plan to keep 'with' mandatory for ISS then Karen's
>>>>> system is very nice. But what do we do for cases like Michelle's example
>>>>> where a whole variety of similarity based methods are used. I find this
>>>>> crops up time after time and I wouldn't want to have to list all methods
>>>>> in this column and it doesn't seem very satisfactory to pick
>>>>> representative examples?
>>>>> Susan
>>>>> On Tue, 2007-09-11 at 19:03 +0000, Valerie Wood wrote:
>>>>> 
>>>>>> That OK,
>>>>>> I just think its rather a trawl to have to create something to go in 
>>>>>> the 'with' field when the PMID of the published algorithm is 
>>>>>> sufficient.
>>>>>> My other reasoning was that these aren't purely based on 'sequence 
>>>>>> similarity', they always include some 'other additional step' (although 
>>>>>> I agree they are 'sequence based')
>>>>>> and thirdly, this could become hazy, if we got functional prediction 
>>>>>> methods which combined sequence data with some experimental date (like 
>>>>>> cellular localization), for example, would be be RCA (I presume). It 
>>>>>> therefore seemed that if the distinction was that ISS needed to have 
>>>>>> some 'object' which represented a sequence in the 'with' column (rather 
>>>>>> than allowing the with column to contain other types of things, 
>>>>>> referring to algorithms), it would be quite a nice distinction. If you 
>>>>>> can't locate this object then the method probably includes something 
>>>>>> else in addition to 'sequence similarity'.
>>>>>> However, these were just for consideration, I really have no strong 
>>>>>> preference either way..... although I prefer easy :)
>>>>>> Val
>>>>>> "Gwinn-Giglio, Michelle" <MLGwinn at jcvi.org> wrote:
>>>>>> 
>>>>>>> Ben,
>>>>>>> Yes, sorry to not be clear - I was disagreeing with Val's suggestion 
>>>>>>> to use RCA for things like TMHMM and tRNAscan. At least I think that 
>>>>>>> was Val's suggestion and that is what I diasagree with.
>>>>>>> Sorry to disagree with you Val. :)
>>>>>>> Michelle
>>>>>>> -----Original Message-----
>>>>>>> From: Benjamin Hitz [mailto:hitz at genome.stanford.edu]
>>>>>>> Sent: Tue 9/11/2007 1:05 PM
>>>>>>> To: Gwinn-Giglio, Michelle
>>>>>>> Cc: GO mailing list
>>>>>>> Subject: Re: [go] Putting method/program names into the with field for 
>>>>>>> ISS
>>>>>>> On Sep 11, 2007, at 7:55 AM, Gwinn-Giglio, Michelle wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> I disagree. I think taking this approach would significantly muddy
>>>>>>>> the waters in terms of distinguishing between ISS and RCA.
>>>>>>>> Anything that is based only on sequence analysis, be it simple
>>>>>>>> Blast or vastly more complicated modeling methods, should be ISS
>>>>>>>> because at their heart they are all comparing sequences of known
>>>>>>>> function to ones with unknown function. Whether they do simple
>>>>>>>> alignments to make that comparison or more complicated models, it
>>>>>>>> is still a sequence based analysis.
>>>>>>> 
>>>>>>> I did not suggest otherwise.
>>>>>>> Ben
>>>>>>> -- 
>>>>>>> Ben Hitz
>>>>>>> Senior Scientific Programmer ** Saccharomyces Genome Database ** GO
>>>>>>> Consortium
>>>>>>> Stanford University ** hitz at genome.stanford.edu
>>>>>> 
>>>> 
>>> 
>> 
>> 
>
>
>
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a 
> charity registered in England with number 1021457 and a company registered in 
> England with number 2742969, whose registered office is 215 Euston Road, 
> London, NW1 2BE.



More information about the Go mailing list