[go] partitioning gene association files
Mike Cherry
cherry at stanford.edu
Mon Jan 28 11:33:31 PST 2008
Why. SGD will not put IEA annotations into our current file. We
believe that is a disservice to the community because so many
developers and users are now used to the budding yeast data without
IEAs. Most of the developers use the SGD file to train their
algorithms and seem to have little understanding of the various
evidence codes and their significance. Thus we want to add a second
file. At the Princeton meeting I asked the group if there was a
preference for the name we would use. I recall Michael then suggested
that if SGD does this all RefGenome annotation files should also be
partitioned.
-Mike
On Jan 28, 2008, at 10:11 AM, Chris Mungall wrote:
>
> I don't think we should overload names or file paths.
>
> If gene_association.mgi.gz currently means all associations, we
> shouldn't change this, even with lead time: if we want to change the
> meaning we should obsolete the URL/file path.
>
> I think the options are:
>
> [1] maintain 3 files:
>
> gene_association.<ORG>.gz
> gene_association.<ORG>.iea_annotations.gz
> gene_association.<ORG>.non-iea_annotations.gz
>
> [2] maintain 2 files:
>
> gene_association.<ORG>.iea_annotations.gz
> gene_association.<ORG>.non-iea_annotations.gz
>
> (and force people to cat if they want [1])
>
> There is no requirement to give users a lead time for [2]. There
> would have to be a lead time for [1], and [2] would be a necessary
> intermediate step towards [1] to give software time to adjust.
>
> If nobody can remember why this is important I suggest going with [2].
>
> OTOH if we do go with [1] and we force people to change their URLs
> and file paths, I suggest a mildly more radical change: we should
> abandon the practice of using dbnames and arbitrary strings as file
> suffixes. The file suffix should denote a file format, each of which
> should be documented on the site. E.g.
>
> .obo
> .go
> .assoc (proposed for association files)
> .<fmt>.gz -- compressed
> .txt - unstructured text
>
> etc
>
> On Jan 28, 2008, at 6:29 AM, Judith Blake wrote:
>
>> Mike,
>> My sense was that this was to be for the GA files for reference
>> genomes only.
>> I am fine with your naming proposal.
>>
>> Judy
>>
>> Mike Cherry wrote:
>>> At the Princeton GOC meeting (our 18th) it was decided to
>>> partition each GA file in two. One file would contain all
>>> annotations with non-IEA evidence, the other would contain all the
>>> annotations with IEA evidence.
>>>
>>> We need to specify this a bit more. I have a script that divides
>>> up the annotations.
>>>
>>> Question: Names of the resulting files? At Princeton I recall it
>>> was agreed to have the file without IEA annotations to keep the
>>> name of the current file. Then there would be a new file for just
>>> the IEA annotations, I didn't find the name mentioned the minutes
>>> but I recall it was something long like
>>> gene_association.XXX.iea_annotations.gz
>>>
>>> For example:
>>>
>>> current file:
>>>
>>> gene_association.mgi.gz
>>>
>>> after partitioning happens:
>>>
>>> gene_association.mgi.gz -- non-IEA annotations
>>> gene_association.mgi.iea_annotations.gz -- IEA annotations
>>>
>>> Question: Both files would be created for all projects? In some
>>> cases all the current annotations are IEA. Here the xxx.gz file
>>> would have no annotations, just a comment to say check the other
>>> file. For other projects there are no IEA annotations, here the
>>> xxx.iea_annotations.gz files would have no annotations just
>>> comments. Most projects will have annotations in both files.
>>>
>>> The submission of files would not change. Each project would
>>> continue to submit the ga file as is done now.
>>>
>>> All this is about changing the processing of the submitted file,
>>> it would become filtered and partitioned in one step.
>>>
>>> We would need to announce and give amply notice of this change, at
>>> least 2-3 months after the announcement.
>>>
>>> -Mike
>>>
>>
More information about the Go
mailing list