[go] partitioning gene association files

Chris Mungall cjm at fruitfly.org
Mon Jan 28 10:11:18 PST 2008


I don't think we should overload names or file paths.

If gene_association.mgi.gz currently means all associations, we  
shouldn't change this, even with lead time: if we want to change the  
meaning we should obsolete the URL/file path.

I think the options are:

[1] maintain 3 files:

gene_association.<ORG>.gz
gene_association.<ORG>.iea_annotations.gz
gene_association.<ORG>.non-iea_annotations.gz

[2] maintain 2 files:

gene_association.<ORG>.iea_annotations.gz
gene_association.<ORG>.non-iea_annotations.gz

(and force people to cat if they want [1])

There is no requirement to give users a lead time for [2]. There  
would have to be a lead time for [1], and [2] would be a necessary  
intermediate step towards [1] to give software time to adjust.

If nobody can remember why this is important I suggest going with [2].

OTOH if we do go with [1] and we force people to change their URLs  
and file paths, I suggest a mildly more radical change: we should  
abandon the practice of using dbnames and arbitrary strings as file  
suffixes. The file suffix should denote a file format, each of which  
should be documented on the site. E.g.

	.obo
	.go
	.assoc (proposed for association files)
	.<fmt>.gz -- compressed
	.txt - unstructured text

etc

On Jan 28, 2008, at 6:29 AM, Judith Blake wrote:

> Mike,
> My sense was that this was to be for the GA files for reference  
> genomes only.
> I am fine with your naming proposal.
>
> Judy
>
> Mike Cherry wrote:
>> At the Princeton GOC meeting (our 18th) it was decided to  
>> partition each GA file in two.  One file would contain all  
>> annotations with non-IEA evidence, the other would contain all the  
>> annotations with IEA evidence.
>>
>> We need to specify this a bit more.  I have a script that divides  
>> up the annotations.
>>
>> Question:  Names of the resulting files?  At Princeton I recall it  
>> was agreed to have the file without IEA annotations to keep the  
>> name of the current file.  Then there would be a new file for just  
>> the IEA annotations, I didn't find the name mentioned the minutes  
>> but I recall it was something long like   
>> gene_association.XXX.iea_annotations.gz
>>
>> For example:
>>
>> current file:
>>
>>   gene_association.mgi.gz
>>
>> after partitioning happens:
>>
>>   gene_association.mgi.gz  -- non-IEA annotations
>>   gene_association.mgi.iea_annotations.gz  -- IEA annotations
>>
>> Question: Both files would be created for all projects?  In some  
>> cases all the current annotations are IEA.  Here the xxx.gz file  
>> would have no annotations, just a comment to say check the other  
>> file.  For other projects there are no IEA annotations, here the  
>> xxx.iea_annotations.gz files would have no annotations just  
>> comments.  Most projects will have annotations in both files.
>>
>> The submission of files would not change.  Each project would  
>> continue to submit the ga file as is done now.
>>
>> All this is about changing the processing of the submitted file,  
>> it would become filtered and partitioned in one step.
>>
>> We would need to announce and give amply notice of this change, at  
>> least 2-3 months after the announcement.
>>
>> -Mike
>>
>




More information about the Go mailing list