[go] mapping between DB_Object_ID and DB_Object_Symbol
Gavin Sherlock
sherlock at genome.Stanford.EDU
Mon Aug 13 13:47:42 PDT 2007
Hi Mike,
In this case, the check is easy to implement in the filtering script,
rather than as a post database loading check. I think it would be
much easier to deal with the file and reject it, rather than having
to roll back the database to a previous successfully loaded file if a
database check throws an error after loading.
Cheers,
Gavin
On Aug 10, 2007, at 1:31 PM, Mike Cherry wrote:
> Several times we have discussed adding checks on the database, ie.
> after the data is loaded. I guess that would create a warning
> because the data was already loaded, as opposed to an error where
> the data is rejected.
>
> I'm off for the next three weeks so will not work on this until
> September.
>
> Questions to the group:
>
> 1. Are there other data checks that need to be implemented at the
> same time.
> 2. What should happen when an "error" is found? Remove all rows
> with the gene id in question, or just all the rows that don't have
> the same symbol as the first one observed.
>
> -Mike
>
>
> On Aug 9, 2007, at 9:26 PM, Gavin Sherlock wrote:
>
>> Hi Mike,
>>
>> It's an easy check to add. For a given file, where $databaseId
>> and $name are the DB_Object_ID and DB_Object_Symbol for the
>> current line respectively, something like:
>>
>> if (exists ($databaseId2StandardName{$databaseId}) && $name ne
>> $databaseId2StandardName{$databaseId}){
>>
>> # do something to say that the databaseId has more than one
>> standard name in the file, and thus reject it
>>
>> }else{
>>
>> # process
>>
>> # now record that we saw it
>>
>> $databaseId2StandardName{$databaseId} = $name;
>>
>> }
>>
>> works just fine (and is essentially what my GO::TernFinder code
>> does. Probably the reverse check should be done to - i.e. a
>> DB_Object_Symbol maps to only one DB_Object_ID.
>>
>> If it is part of the spec (and is spelled out on the annotation
>> file format page, which it isn't currently), then I think files
>> that don't follow the rule should be rejected.
>>
>> Cheers,
>> Gavin
>>
>> On Aug 9, 2007, at 9:13 PM, Mike Cherry wrote:
>>
>>> I think 1:1 is the intent.
>>>
>>> The documentation says the DB_Object_Symbol field has cardinality
>>> of 1. The checking script is looking for the pipe symbol in that
>>> field. It requires just one symbol, and will not allow zero or
>>> more than one symbol to be started on a line.
>>>
>>> The checking script will not find multiple relationships if they
>>> are spread across multiple lines in the file. The filtering
>>> script is concerned with format of the information, cardinality,
>>> and some very basic things like is an abbreviation okay. As
>>> written is does not compare two lines within the file, it just
>>> checked each line independently.
>>>
>>> A check of the database could report these errors. There is an
>>> easy UNIX command method to check for this problem. For example
>>> with the pombe file, its all one long command on one line:
>>>
>>> % gzcat gene_association.GeneDB_Spombe.gz | cut -f2,3 | sort -u |
>>> cut -f1 | sort | uniq -c | sort -rn | grep -v ' 1 '
>>>
>>> Any ID in the result that has a number greater than 1 is an ID
>>> that has more than 1 symbol associated somewhere within the gene
>>> association file. For the current pombe file that would be 388
>>> of the 5073 gene IDs.
>>>
>>> Yes the RGD (96), pseudocap (1) and WormBase (16) gene
>>> association files all have a few of this type of issue.
>>>
>>> -Mike
>>>
>>>
>>> On Aug 9, 2007, at 7:38 PM, Gavin Sherlock wrote:
>>>
>>>> Hi all,
>>>>
>>>> An issue came up with GO::TermFinder, because it chokes on files
>>>> where the relationship between DB_Object_ID and DB_Object_Symbol
>>>> is not 1:1, and there are a number of files that have for
>>>> instance a 1:2 relationship between these columns, e.g.:
>>>>
>>>> GeneDB_Spombe: SPCC777.13 maps to SPCC777.13, vps35
>>>> pseudocap: PA5429 maps to aspA, adhA
>>>> RGD: RGD:1359623 maps to Tuba4a, Tuba4
>>>> WB: WBGene00000386 maps to cdc-25.1, cdc25.1
>>>>
>>>> My question is, should this be a 1:1 relationship, and the
>>>> annotation files checking script needs to reject files that
>>>> deviate from that (presumably these additional names would
>>>> become synonyms instead), or is a 1:2 or more relationship
>>>> allowed between those columns, in which case, I'll have to
>>>> modify GO::TermFinder appropriately.
>>>>
>>>> As an additional data point, the pombe file actually lists both
>>>> SPCC777.13 and vps35 as synonyms for the gene too :
>>>>
>>>> whitbread 1001 % grep 'SPCC777.13' gene_association.GeneDB_Spombe
>>>> GeneDB_Spombe SPCC777.13 SPCC777.13 GO:
>>>> 0003674 GO_REF:0000015 ND
>>>> F gene taxon:4896
>>>> 20070711GeneDB_Spombe
>>>> GeneDB_Spombe SPCC777.13 vps35 GO:0005768
>>>> PMID:16622069 IMP C retromer complex subunit
>>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>>> 20060424 GeneDB_Spombe
>>>> GeneDB_Spombe SPCC777.13 vps35 GO:0030904
>>>> PMID:16622069 IMP C retromer complex subunit
>>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>>> 20040625 GeneDB_Spombe
>>>> GeneDB_Spombe SPCC777.13 vps35 GO:0030904
>>>> PMID:16622069 ISS SGD:S000003690 C retromer complex
>>>> subunit Vps35 SPCC777.13|vps35gene taxon:4896
>>>> 20040625 GeneDB_Spombe
>>>> GeneDB_Spombe SPCC777.13 vps35 GO:0006886
>>>> PMID:16622069 IMP P retromer complex subunit
>>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>>> 20040625 GeneDB_Spombe
>>>> GeneDB_Spombe SPCC777.13 vps35 GO:0042147
>>>> PMID:16622069 IMP P retromer complex subunit
>>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>>> 20060424 GeneDB_Spombe
>>>> GeneDB_Spombe SPCC777.13 vps35 GO:0030437
>>>> PMID:15189449 IMP P retromer complex subunit
>>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>>> 20040625 GeneDB_Spombe
>>>> GeneDB_Spombe SPCC777.13 vps35 GO:0005829
>>>> PMID:16823372 IDA C retromer complex subunit
>>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>>> 20060724 GeneDB_Spombe
>>>>
>>>> - is there a rule (I couldn't find one) that says the synonyms
>>>> should not repeat the DB_Object_ID and DB_Object_Symbol, or
>>>> should there be? Would it save any space in the file sizes?
>>>>
>>>> Cheers,
>>>> Gavin
>>>> ________________________________________________________
>>>>
>>>> Gavin Sherlock
>>>> Dept. of Genetics
>>>> S201A, Grant Building,
>>>> Stanford University Medical School,
>>>> Stanford,
>>>> CA 94305-5120
>>>>
>>>> Tel: 650 498 6012
>>>> Fax: 650 724 3701
>>>>
>>>>
More information about the Go
mailing list