[go] partitioning gene association files

Chris Mungall cjm at fruitfly.org
Mon Jan 28 10:45:46 PST 2008


On Jan 28, 2008, at 10:11 AM, Chris Mungall wrote:

>
> I don't think we should overload names or file paths.
>
> If gene_association.mgi.gz currently means all associations, we  
> shouldn't change this, even with lead time: if we want to change  
> the meaning we should obsolete the URL/file path.
>
> I think the options are:
>
> [1] maintain 3 files:
>
> gene_association.<ORG>.gz
> gene_association.<ORG>.iea_annotations.gz
> gene_association.<ORG>.non-iea_annotations.gz
>
> [2] maintain 2 files:
>
> gene_association.<ORG>.iea_annotations.gz
> gene_association.<ORG>.non-iea_annotations.gz
>
> (and force people to cat if they want [1])
>

Oops, I switched numbers halfway through (thanks JohnM); strike:

> There is no requirement to give users a lead time for [2]. There  
> would have to be a lead time for [1], and [2] would be a necessary  
> intermediate step towards [1] to give software time to adjust.
>
> If nobody can remember why this is important I suggest going with [2].

should be:

There is no requirement to give users a lead time for [1]. There  
would have to be a lead time for [2], and [1] would be a necessary  
intermediate step towards [2] to give software time to adjust.

If nobody can remember why this is important I suggest going with [1].

sorry!

>
> OTOH if we do go with [1] and we force people to change their URLs  
> and file paths, I suggest a mildly more radical change: we should  
> abandon the practice of using dbnames and arbitrary strings as file  
> suffixes. The file suffix should denote a file format, each of  
> which should be documented on the site. E.g.
>
> 	.obo
> 	.go
> 	.assoc (proposed for association files)
> 	.<fmt>.gz -- compressed
> 	.txt - unstructured text
>
> etc
>
> On Jan 28, 2008, at 6:29 AM, Judith Blake wrote:
>
>> Mike,
>> My sense was that this was to be for the GA files for reference  
>> genomes only.
>> I am fine with your naming proposal.
>>
>> Judy
>>
>> Mike Cherry wrote:
>>> At the Princeton GOC meeting (our 18th) it was decided to  
>>> partition each GA file in two.  One file would contain all  
>>> annotations with non-IEA evidence, the other would contain all  
>>> the annotations with IEA evidence.
>>>
>>> We need to specify this a bit more.  I have a script that divides  
>>> up the annotations.
>>>
>>> Question:  Names of the resulting files?  At Princeton I recall  
>>> it was agreed to have the file without IEA annotations to keep  
>>> the name of the current file.  Then there would be a new file for  
>>> just the IEA annotations, I didn't find the name mentioned the  
>>> minutes but I recall it was something long like   
>>> gene_association.XXX.iea_annotations.gz
>>>
>>> For example:
>>>
>>> current file:
>>>
>>>   gene_association.mgi.gz
>>>
>>> after partitioning happens:
>>>
>>>   gene_association.mgi.gz  -- non-IEA annotations
>>>   gene_association.mgi.iea_annotations.gz  -- IEA annotations
>>>
>>> Question: Both files would be created for all projects?  In some  
>>> cases all the current annotations are IEA.  Here the xxx.gz file  
>>> would have no annotations, just a comment to say check the other  
>>> file.  For other projects there are no IEA annotations, here the  
>>> xxx.iea_annotations.gz files would have no annotations just  
>>> comments.  Most projects will have annotations in both files.
>>>
>>> The submission of files would not change.  Each project would  
>>> continue to submit the ga file as is done now.
>>>
>>> All this is about changing the processing of the submitted file,  
>>> it would become filtered and partitioned in one step.
>>>
>>> We would need to announce and give amply notice of this change,  
>>> at least 2-3 months after the announcement.
>>>
>>> -Mike
>>>
>>
>
>




More information about the Go mailing list