[go] partitioning gene association files
Chris Mungall
cjm at fruitfly.org
Mon Jan 28 10:11:18 PST 2008
I don't think we should overload names or file paths.
If gene_association.mgi.gz currently means all associations, we
shouldn't change this, even with lead time: if we want to change the
meaning we should obsolete the URL/file path.
I think the options are:
[1] maintain 3 files:
gene_association.<ORG>.gz
gene_association.<ORG>.iea_annotations.gz
gene_association.<ORG>.non-iea_annotations.gz
[2] maintain 2 files:
gene_association.<ORG>.iea_annotations.gz
gene_association.<ORG>.non-iea_annotations.gz
(and force people to cat if they want [1])
There is no requirement to give users a lead time for [2]. There
would have to be a lead time for [1], and [2] would be a necessary
intermediate step towards [1] to give software time to adjust.
If nobody can remember why this is important I suggest going with [2].
OTOH if we do go with [1] and we force people to change their URLs
and file paths, I suggest a mildly more radical change: we should
abandon the practice of using dbnames and arbitrary strings as file
suffixes. The file suffix should denote a file format, each of which
should be documented on the site. E.g.
.obo
.go
.assoc (proposed for association files)
.<fmt>.gz -- compressed
.txt - unstructured text
etc
On Jan 28, 2008, at 6:29 AM, Judith Blake wrote:
> Mike,
> My sense was that this was to be for the GA files for reference
> genomes only.
> I am fine with your naming proposal.
>
> Judy
>
> Mike Cherry wrote:
>> At the Princeton GOC meeting (our 18th) it was decided to
>> partition each GA file in two. One file would contain all
>> annotations with non-IEA evidence, the other would contain all the
>> annotations with IEA evidence.
>>
>> We need to specify this a bit more. I have a script that divides
>> up the annotations.
>>
>> Question: Names of the resulting files? At Princeton I recall it
>> was agreed to have the file without IEA annotations to keep the
>> name of the current file. Then there would be a new file for just
>> the IEA annotations, I didn't find the name mentioned the minutes
>> but I recall it was something long like
>> gene_association.XXX.iea_annotations.gz
>>
>> For example:
>>
>> current file:
>>
>> gene_association.mgi.gz
>>
>> after partitioning happens:
>>
>> gene_association.mgi.gz -- non-IEA annotations
>> gene_association.mgi.iea_annotations.gz -- IEA annotations
>>
>> Question: Both files would be created for all projects? In some
>> cases all the current annotations are IEA. Here the xxx.gz file
>> would have no annotations, just a comment to say check the other
>> file. For other projects there are no IEA annotations, here the
>> xxx.iea_annotations.gz files would have no annotations just
>> comments. Most projects will have annotations in both files.
>>
>> The submission of files would not change. Each project would
>> continue to submit the ga file as is done now.
>>
>> All this is about changing the processing of the submitted file,
>> it would become filtered and partitioned in one step.
>>
>> We would need to announce and give amply notice of this change, at
>> least 2-3 months after the announcement.
>>
>> -Mike
>>
>
More information about the Go
mailing list