CC to MF links (was Re: [go] contributes_to question)

Chris Mungall cjm at fruitfly.org
Mon Aug 20 11:51:19 PDT 2007


On Aug 20, 2007, at 9:51 AM, Ben Hitz wrote:

> Chris -
>
> I agree this is how it should work.  But there are some "gotchas"  
> from the software/database side that need to be addressed (not  
> necessarily at this instant).
>
> Say I want a list of all genes "directly involved" in the histone  
> deacetylase activity.    Now, whether or not this should include  
> SIR2 might be a matter of debate - but lets say that I at least  
> want all members of the complex.  The _software_ has to infer  
> backwards that when I say "show me these genes" I also want an  
> exhuastive list of members of the complex.

Ben, did you consciously shift the example from SIF2 to SIR2?

Just to recap:

SIR2 is annotated as having HD activity, and is localised to the RENT  
complex
	(amongst other things)
SIF2 is annotated as contributing to HD activity, but should not be,  
according to Val

Let's assume the latter is rectified and SIF2 is annotated to HD  
complex but not HD activity (neither contributes_to not direct)

If you want to know the known members of a specific complex in a  
particular species, this is just standard par-for-the-course GO queries.

If you want to know the genes involved in HD activity you do a normal  
GO DAG query, but do not traverse any CC-MF links. If you  
specifically want "directly involved in HD activity", it is the same  
query but omitting any annotations with the contributes_to qualifier.

I think the tricky question is whether it is a good idea to allow  
queries of the form "show me genes involved in X activity or  
localised to complexes that have X activity", and if so how these  
queries should be presented to a user in a non-confusing way.

I don't think there should be any debate involved on a case-by-case  
basis - we should have rules about how information is propagated. I'm  
not quite following your example about inferring backwards.

> Maybe this is obvious, but I think much software exists which  
> doesn't make any inferences.

I place the external software that allows GO queries or GO based  
analyses into 3 categories:

[1] makes no inferences - ie no DAG traversal whatsoever
[2] uses the DAG, but ignores the relation, and assumes information  
can be propagated up the DAG regardless
[3] uses the DAG and the relations in the DAG

There is a scarily high amount of tools and interfaces in [1], which  
is something we have to work on as part of our outreach, but can be  
considered separately from the CC to MF links.

The majority falls into [2], which means the CC-to-MF links should be  
an optional extension to the main GO files. This will ensure [2] will  
continue to work correctly without erroneous inferences, and the more  
advanced providers can consciously use the additional links to  
provide more advanced capabilities.

>
> Ben
>
> On Aug 17, 2007, at 5:01 PM, Chris Mungall wrote:
>
>> Related to the contributes_to question and the relations between  
>> proteins, protein complexes and molecular functions:
>>
>> Currently in GO there is no explicitly asserted link between:
>>
>> CC - GO:0000118 histone deacetylase complex
>> MF - GO:0004407 histone deacetylase activity
>> BP - GO:0016575 histone deacetylation
>>
>> Clearly the function, process and components denoted by these terms
>> are inter-related: the CC executes the MF, the MF catalyses the BP
>>
>> The parts of a whole do not necessarily inherit the function of the
>> whole; the whole does not inherit the function of the parts; and the
>> sibling parts of a whole do not necessarily share the same
>> function. These kinds of rules can be stated formally so that  
>> there is
>> less room for confusion (just like the true path rule).
>>
>> I suspect that one reason annotators may be tempted to make the
>> erroneous transitive inference and transfer the function of the whole
>> (complex) to the part (gene product) is because there is a perceived
>> loss of information in *not* doing so.
>>
>> For example, if correct curation protocol is followed, then SIF2
>> should not be annotated to HD Activity (MF), only to HD complex
>> (CC). Searches for the MF "HD Activity" will exclude SIF2. This is
>> correct behavior. However, it may be useful to have some intuitive  
>> way
>> of navigating from a search on "HD activity" to SIF2, by means of the
>> complex, so long as it is obvious that SIF2 does not inherit the
>> function of the complex.
>>
>> Using the latest results from Obol, we can now link terms across GO
>> ontologies. Links between CC and MF the relation would be labeled
>> something like 'executes' or simply 'has function'. In a tree-type
>> display we might show:
>>
>>    [i] GO:0019213 deacetylase activity
>>      [i] GO:0033558 protein deacetylase activity
>>       [i] GO:0004407 histone deacetylase activity    [RPD3]
>>        [X] GO:0000118 histone deacetylase complex    [SIF2,  
>> SPCC1235.09]
>>         [i] GO:0000508 Rpd3L complex                 [RPD3]
>>         [i] GO:0000509 Rpd3S complex
>>         [i] GO:0032221 Clr6 histone deacetylase complex
>>         ...
>>
>> This display correctly represents the biology, but the danger here is
>> that over the years we have built up an expectation in our users that
>> the relation label can be ignored and gene products can be propagated
>> up the DAG, willy-nilly. The correct way to read the DAG above is:
>>
>>   SIF2 is localized_to HD complex,
>>   HD complex has_function HD activity
>>
>> And we can infer
>>
>>   SIF2 is localized_to some complex that has_function deacetylase
>>   activity
>>
>> But we *cannot* infer anything about the activity of SIF2 without
>> further evidence. We would not propagate SIF2 up in slimmers, term
>> enrichment, gene product count summaries or any other graph based
>> operation (a curator *may* apply their expertise and decide to make
>> contributes_to annotations based on these CC to MF links, but this
>> would not be automatic).
>>
>> This means we have to be careful about how we release these  
>> (valuable)
>> cross-ontology links to the public, and ensure they are not  
>> abused. From
>> a software perspective we are almost ready to load these kinds of
>> links and start showing them in AmiGO, but we should proceed  
>> carefully
>> to make sure these kinds of relations are better understood both
>> within GO and outside.
>>
>> This seems to be related to the contributes_to issue. Is this worth
>> discussing in the same slot at the GO meeting?
>>
>> The (unvetted) CC to MF links are in cvs:
>>
>>   go/scratch/obol_results/ 
>> cellular_component_links_to_molecular_function.obo
>>
>> Cheers
>> Chris
>>
>> On Aug 16, 2007, at 5:19 AM, Valerie Wood wrote:
>>
>>> It seems we have all used it slightly differently anyway.
>>>
>>> But here are two 2 examples why it is bad.
>>>
>>> 1.
>>> I had annotated the ortholog of  S. cerevisiae SIF2 (histone  
>>> deacetylase complex subunit) to
>>> histone deacetylase activity, contributes_to ISS.
>>> It is a WD repeat protein  (which doesn't have HD activity, so it  
>>> seems odd to attribute this function) the original SGD annotation  
>>> is IPI.
>>> I am now removing the pombe  annotation.
>>>
>>> 2.
>>> FET3/YMR058W
>>> is a copper oxidate involved in iron assimilation by reduction  
>>> and transport. it isn't a transporter but it is part of the  
>>> transporter complex.
>>> This has an iron transporter activity (with contributes to) in  
>>> SGD, and has been ISS's to this activity (without contributes_to)  
>>> by two drosophila genes (FBgn0032116 and FBgn0039387)
>>>
>>> I see man many examples of this (too many to give feedback on)
>>>
>>> Can this go on the agenda for September meeting?
>>>
>>> Val
>>>
>>>
>>>
>>>
>>>
>>> Pascale Gaudet wrote:
>>>
>>>> I did mean unessential role; ie, the complex might have the  
>>>> activity without the protein you're annotating, but adding it  
>>>> enhances the activity (but not a regulator-- that would be  
>>>> 'positive regulation of...'). But if adding it does nothing, I  
>>>> would annotate to unknown.
>>>>
>>>> Pascale
>>>>
>>>> Valerie Wood wrote:
>>>>
>>>>> It seems so to me too, these are equivalent to process annotations
>>>>> But did you mean essential role in the activity ? This is how I  
>>>>> would use it.
>>>>>
>>>>> VAl
>>>>>
>>>>>
>>>>> Pascale Gaudet <pgaudet at northwestern.edu> wrote:
>>>>>> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
>>>>>> <html>
>>>>>> <head>
>>>>>>  <meta content="text/html;charset=ISO-8859-1" http- 
>>>>>> equiv="Content-Type">
>>>>>>  <title></title>
>>>>>> </head>
>>>>>> <body bgcolor="#ffffff" text="#000000">
>>>>>> Val, <br>
>>>>>> My understanding was that the subunit had to have at least an
>>>>>> unessential role in the activity, although the documentation  
>>>>>> is very
>>>>>> ambiguous. But what you are describing is really capturing  
>>>>>> component
>>>>>> information with a function annotation. That seems wrong. <br>
>>>>>> <br>
>>>>>> Pascale<br>
>>>>>> <br>
>>>>>> <br>
>>>>>> Valerie Wood wrote:
>>>>>> <blockquote cite="mid:E1ILDTY-0006f2- 
>>>>>> Vx at web-2-10.internal.sanger.ac.uk"
>>>>>> type="cite">
>>>>>>  <pre wrap="">I'm really asking the question why arbitrarily  
>>>>>> add these function annotations to the 'unknown' subunits
>>>>>> of complexes in the first place, when they are clearly not the  
>>>>>> subunit that posseses the catalytic activity, or when they  
>>>>>> clearly have another activity.
>>>>>>
>>>>>> Some of these complexes have
>>>>>> ATPase activity,
>>>>>> ubiquitin ligase activity
>>>>>> acetyltransferase activity
>>>>>> etc.
>>>>>>
>>>>>> so if this type of annotation was valid (or useful)  then we  
>>>>>> would (presumably) add all these annotations to all subunits  
>>>>>> for completion?
>>>>>>
>>>>>> Wouldn't users rather see  which subunits had known function  
>>>>>> and which had 'unknown function'.
>>>>>> It just seems that the qualifier is being used much more  
>>>>>> liberally than was originally intended (i.e as a filler to  
>>>>>> avoid adding an 'unknown' annotation)
>>>>>>
>>>>>> and it skews functional predictions/genome comparisons.
>>>>>>
>>>>>>
>>>>>> val
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Chris Mungall <a class="moz-txt-link-rfc2396E"  
>>>>>> href="mailto:cjm at fruitfly.org">&lt;cjm at fruitfly.org&gt;</a>  
>>>>>> wrote:  </pre>
>>>>>>  <blockquote type="cite">
>>>>>>    <pre wrap="">..which like many such recommendations will be  
>>>>>> ignored by the  majority of implementations (in this case it  
>>>>>> is forgivable if we  issue the recommendation at this late  
>>>>>> stage..)
>>>>>>
>>>>>> Perhaps any association qualified in any way should be omitted  
>>>>>> from  the default annotations we provide. We would of course  
>>>>>> also provide  the full annotation set but it would be made  
>>>>>> obvious that this  'advanced' set came with certain caveats
>>>>>>
>>>>>> On Aug 14, 2007, at 8:00 AM, Midori Harris wrote:
>>>>>>
>>>>>>    </pre>
>>>>>>    <blockquote type="cite">
>>>>>>      <pre wrap="">Whatever we decide, I would recommend that  
>>>>>> computational analyses  omit 'contributes_to' annotations.
>>>>>>
>>>>>> m
>>>>>>
>>>>>> On Mon, 13 Aug 2007, Valerie Wood wrote:
>>>>>>
>>>>>>      </pre>
>>>>>>      <blockquote type="cite">
>>>>>>        <pre wrap="">Recently I'm wondering recently why we  
>>>>>> have 2 meanings for  contributes_to:
>>>>>>
>>>>>> When the qualifier was initially implemented, it was so  
>>>>>> function  terms could be added to complexes like DNA  
>>>>>> polymerase and the F1  Fo ATPase where the function cannot be  
>>>>>> attributed to a single  subunit. This seems fine.
>>>>>>
>>>>>> Increasingly I see annotations to complexes which are  
>>>>>> described as  (for example) a histone acetyltransferase  
>>>>>> complex, and all of the  subunits are given histone de/ 
>>>>>> acetlytransferase or  methyltransferase activity with  
>>>>>> contributes_to, even thought the  other subunits clearly have  
>>>>>> other functions (I see ATPases,  ubiquitin ligases actin-like  
>>>>>> proteins etc, which are commonly  associated with histone  
>>>>>> acetyltransferases and methyltransferases).
>>>>>>
>>>>>> This seems odd, for a number of reasons.
>>>>>> Often these subunits are not required for the activity, but  
>>>>>> their  deletion (sometimes, but not always) affects the rate   
>>>>>> the activity
>>>>>>
>>>>>> Primarily I don't understand what this type of  
>>>>>> 'contributes_to'  annotation provides  to GO users above a  
>>>>>> process annotation to the  histone acetylation (if this has  
>>>>>> been shown), a complex  annotation, and a function term to  
>>>>>> unknown/root node.  Isn't it  more useful to know that there  
>>>>>> is some information about the  process, but the molecular  
>>>>>> function is not known?
>>>>>>
>>>>>> 1) Another problem is that these particular chromatin  
>>>>>> associated  complexes often have shared subunits so the  
>>>>>> function annotations  aren't so clear-cut (i.e some of these  
>>>>>> subunits may be members of  other complexes which do not have  
>>>>>> this activity)
>>>>>>
>>>>>> 2) Also computational analysis using RCA which infer these   
>>>>>> 'functions' to similar proteins which, from their domain   
>>>>>> composition it is unlikely possess this activity. 3) It makes   
>>>>>> cross species comparisons difficult because you get different   
>>>>>> numbers of functions to what you would  expect when comparing   
>>>>>> annotations between species. For example it is known how many   
>>>>>> histone acetyltransferases /methytrasferases etc. pombe has,   
>>>>>> compered to S. cerevisiae, but when I compare the 2 the  
>>>>>> numbers  are skewed.
>>>>>>
>>>>>> The documentation clearly allows this (although there is not  
>>>>>> an  example of this type of annotation in the documentation,  
>>>>>> so I  wonder if this is what we meant?):
>>>>>>
>>>>>>        </pre>
>>>>>>        <blockquote type="cite">
>>>>>>          <pre wrap="">From the documentation:
>>>>>>          </pre>
>>>>>>        </blockquote>
>>>>>>        <pre wrap="">
>>>>>> Annotating individual gene products according to attributes of  
>>>>>> a  complex is especially useful for molecular function  
>>>>>> annotations in  cases where a complex has an activity, but not  
>>>>>> all of the  individual subunits do. (For example, there may be  
>>>>>> a known  catalytic subunit and one or more additional  
>>>>>> subunits, or the  activity may only be present when the  
>>>>>> complex is assembled.)  Molecular function annotations of  
>>>>>> complex subunits that are not  known to possess the activity  
>>>>>> of the complex must include the  entry contributes_to in the  
>>>>>> Qualifier column.
>>>>>>
>>>>>> Note that contributes_to is not needed to annotate a  
>>>>>> catalytic  subunit. Furthermore, contributes_to may be used  
>>>>>> for any non- catalytic subunit, whether the subunit is  
>>>>>> essential for the  activity of the complex or not.
>>>>>>
>>>>>>
>>>>>>
>>>>>>        </pre>
>>>>>>      </blockquote>
>>>>>>    </blockquote>
>>>>>>  </blockquote>
>>>>>>  <pre wrap=""><!---->
>>>>>>  </pre>
>>>>>> </blockquote>
>>>>>> </body>
>>>>>> </html>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>> -- 
>>> The Wellcome Trust Sanger Institute is operated by Genome  
>>> Research Limited, a charity registered in England with number  
>>> 1021457 and a company registered in England with number 2742969,  
>>> whose registered office is 215 Euston Road, London, NW1 2BE.
>>>
>
> --
> Ben Hitz
> Senior Scientific Programmer ** Saccharomyces Genome Database ** GO  
> Consortium
> Stanford University ** hitz at genome.stanford.edu
>
>
>
>




More information about the Go mailing list