[go] partitioning gene association files
Chris Mungall
cjm at fruitfly.org
Mon Jan 28 10:45:46 PST 2008
On Jan 28, 2008, at 10:11 AM, Chris Mungall wrote:
>
> I don't think we should overload names or file paths.
>
> If gene_association.mgi.gz currently means all associations, we
> shouldn't change this, even with lead time: if we want to change
> the meaning we should obsolete the URL/file path.
>
> I think the options are:
>
> [1] maintain 3 files:
>
> gene_association.<ORG>.gz
> gene_association.<ORG>.iea_annotations.gz
> gene_association.<ORG>.non-iea_annotations.gz
>
> [2] maintain 2 files:
>
> gene_association.<ORG>.iea_annotations.gz
> gene_association.<ORG>.non-iea_annotations.gz
>
> (and force people to cat if they want [1])
>
Oops, I switched numbers halfway through (thanks JohnM); strike:
> There is no requirement to give users a lead time for [2]. There
> would have to be a lead time for [1], and [2] would be a necessary
> intermediate step towards [1] to give software time to adjust.
>
> If nobody can remember why this is important I suggest going with [2].
should be:
There is no requirement to give users a lead time for [1]. There
would have to be a lead time for [2], and [1] would be a necessary
intermediate step towards [2] to give software time to adjust.
If nobody can remember why this is important I suggest going with [1].
sorry!
>
> OTOH if we do go with [1] and we force people to change their URLs
> and file paths, I suggest a mildly more radical change: we should
> abandon the practice of using dbnames and arbitrary strings as file
> suffixes. The file suffix should denote a file format, each of
> which should be documented on the site. E.g.
>
> .obo
> .go
> .assoc (proposed for association files)
> .<fmt>.gz -- compressed
> .txt - unstructured text
>
> etc
>
> On Jan 28, 2008, at 6:29 AM, Judith Blake wrote:
>
>> Mike,
>> My sense was that this was to be for the GA files for reference
>> genomes only.
>> I am fine with your naming proposal.
>>
>> Judy
>>
>> Mike Cherry wrote:
>>> At the Princeton GOC meeting (our 18th) it was decided to
>>> partition each GA file in two. One file would contain all
>>> annotations with non-IEA evidence, the other would contain all
>>> the annotations with IEA evidence.
>>>
>>> We need to specify this a bit more. I have a script that divides
>>> up the annotations.
>>>
>>> Question: Names of the resulting files? At Princeton I recall
>>> it was agreed to have the file without IEA annotations to keep
>>> the name of the current file. Then there would be a new file for
>>> just the IEA annotations, I didn't find the name mentioned the
>>> minutes but I recall it was something long like
>>> gene_association.XXX.iea_annotations.gz
>>>
>>> For example:
>>>
>>> current file:
>>>
>>> gene_association.mgi.gz
>>>
>>> after partitioning happens:
>>>
>>> gene_association.mgi.gz -- non-IEA annotations
>>> gene_association.mgi.iea_annotations.gz -- IEA annotations
>>>
>>> Question: Both files would be created for all projects? In some
>>> cases all the current annotations are IEA. Here the xxx.gz file
>>> would have no annotations, just a comment to say check the other
>>> file. For other projects there are no IEA annotations, here the
>>> xxx.iea_annotations.gz files would have no annotations just
>>> comments. Most projects will have annotations in both files.
>>>
>>> The submission of files would not change. Each project would
>>> continue to submit the ga file as is done now.
>>>
>>> All this is about changing the processing of the submitted file,
>>> it would become filtered and partitioned in one step.
>>>
>>> We would need to announce and give amply notice of this change,
>>> at least 2-3 months after the announcement.
>>>
>>> -Mike
>>>
>>
>
>
More information about the Go
mailing list