[go] partitioning gene association files

Doug howe dhowe at cs.uoregon.edu
Tue Jan 29 10:55:30 PST 2008


I understand Mike's position and concern.  However, the converse of that 
position (if we go with Chris' option [2]) is that all other communities 
which may be used to having IEA annotations included in the one file, 
will now have to learn to fetch and cat the two new files to get what 
they used to.  Either way someone has to change/learn something.   

I agree with Chris' contention that we should not overload file names or 
paths by changing the content of the files without changing the name. 
For that  reason I prefer Chris' option [1] if we must introduce a split 
in the existing file.  It seems a bit silly however to store so much 
data redundantly...

If even advanced users who work with GA files can't get past the 
distinction between IEA and experimental codes, I have to wonder if they 
are serving any purpose worth their hassle?  By splitting the file we 
are just shielding users from the complexity of the evidence codes and 
allowing them to continue to not understand them.

-Doug

Mike Cherry wrote:
> Why.  SGD will not put IEA annotations into our current file.  We 
> believe that is a disservice to the community because so many 
> developers and users are now used to the budding yeast data without 
> IEAs.  Most of the developers use the SGD file to train their 
> algorithms and seem to have little understanding of the various 
> evidence codes and their significance.  Thus we want to add a second 
> file.  At the Princeton meeting I asked the group if there was a 
> preference for the name we would use.  I recall Michael then suggested 
> that if SGD does this all RefGenome annotation files should also be 
> partitioned.
>
> -Mike
>
>
> On Jan 28, 2008, at 10:11 AM, Chris Mungall wrote:
>
>>
>> I don't think we should overload names or file paths.
>>
>> If gene_association.mgi.gz currently means all associations, we 
>> shouldn't change this, even with lead time: if we want to change the 
>> meaning we should obsolete the URL/file path.
>>
>> I think the options are:
>>
>> [1] maintain 3 files:
>>
>> gene_association.<ORG>.gz
>> gene_association.<ORG>.iea_annotations.gz
>> gene_association.<ORG>.non-iea_annotations.gz
>>
>> [2] maintain 2 files:
>>
>> gene_association.<ORG>.iea_annotations.gz
>> gene_association.<ORG>.non-iea_annotations.gz
>>
>> (and force people to cat if they want [1])
>>
>> There is no requirement to give users a lead time for [2]. There 
>> would have to be a lead time for [1], and [2] would be a necessary 
>> intermediate step towards [1] to give software time to adjust.
>>
>> If nobody can remember why this is important I suggest going with [2].
>>
>> OTOH if we do go with [1] and we force people to change their URLs 
>> and file paths, I suggest a mildly more radical change: we should 
>> abandon the practice of using dbnames and arbitrary strings as file 
>> suffixes. The file suffix should denote a file format, each of which 
>> should be documented on the site. E.g.
>>
>>     .obo
>>     .go
>>     .assoc (proposed for association files)
>>     .<fmt>.gz -- compressed
>>     .txt - unstructured text
>>
>> etc
>>
>> On Jan 28, 2008, at 6:29 AM, Judith Blake wrote:
>>
>>> Mike,
>>> My sense was that this was to be for the GA files for reference 
>>> genomes only.
>>> I am fine with your naming proposal.
>>>
>>> Judy
>>>
>>> Mike Cherry wrote:
>>>> At the Princeton GOC meeting (our 18th) it was decided to partition 
>>>> each GA file in two.  One file would contain all annotations with 
>>>> non-IEA evidence, the other would contain all the annotations with 
>>>> IEA evidence.
>>>>
>>>> We need to specify this a bit more.  I have a script that divides 
>>>> up the annotations.
>>>>
>>>> Question:  Names of the resulting files?  At Princeton I recall it 
>>>> was agreed to have the file without IEA annotations to keep the 
>>>> name of the current file.  Then there would be a new file for just 
>>>> the IEA annotations, I didn't find the name mentioned the minutes 
>>>> but I recall it was something long like  
>>>> gene_association.XXX.iea_annotations.gz
>>>>
>>>> For example:
>>>>
>>>> current file:
>>>>
>>>>  gene_association.mgi.gz
>>>>
>>>> after partitioning happens:
>>>>
>>>>  gene_association.mgi.gz  -- non-IEA annotations
>>>>  gene_association.mgi.iea_annotations.gz  -- IEA annotations
>>>>
>>>> Question: Both files would be created for all projects?  In some 
>>>> cases all the current annotations are IEA.  Here the xxx.gz file 
>>>> would have no annotations, just a comment to say check the other 
>>>> file.  For other projects there are no IEA annotations, here the 
>>>> xxx.iea_annotations.gz files would have no annotations just 
>>>> comments.  Most projects will have annotations in both files.
>>>>
>>>> The submission of files would not change.  Each project would 
>>>> continue to submit the ga file as is done now.
>>>>
>>>> All this is about changing the processing of the submitted file, it 
>>>> would become filtered and partitioned in one step.
>>>>
>>>> We would need to announce and give amply notice of this change, at 
>>>> least 2-3 months after the announcement.
>>>>
>>>> -Mike
>>>>
>>>
>



More information about the Go mailing list