[go] mapping between DB_Object_ID and DB_Object_Symbol
Mike Cherry
cherry at stanford.edu
Fri Aug 10 10:31:04 PDT 2007
Several times we have discussed adding checks on the database, ie.
after the data is loaded. I guess that would create a warning
because the data was already loaded, as opposed to an error where the
data is rejected.
I'm off for the next three weeks so will not work on this until
September.
Questions to the group:
1. Are there other data checks that need to be implemented at the
same time.
2. What should happen when an "error" is found? Remove all rows with
the gene id in question, or just all the rows that don't have the
same symbol as the first one observed.
-Mike
On Aug 9, 2007, at 9:26 PM, Gavin Sherlock wrote:
> Hi Mike,
>
> It's an easy check to add. For a given file, where $databaseId and
> $name are the DB_Object_ID and DB_Object_Symbol for the current
> line respectively, something like:
>
> if (exists ($databaseId2StandardName{$databaseId}) && $name ne
> $databaseId2StandardName{$databaseId}){
>
> # do something to say that the databaseId has more than one
> standard name in the file, and thus reject it
>
> }else{
>
> # process
>
> # now record that we saw it
>
> $databaseId2StandardName{$databaseId} = $name;
>
> }
>
> works just fine (and is essentially what my GO::TernFinder code
> does. Probably the reverse check should be done to - i.e. a
> DB_Object_Symbol maps to only one DB_Object_ID.
>
> If it is part of the spec (and is spelled out on the annotation
> file format page, which it isn't currently), then I think files
> that don't follow the rule should be rejected.
>
> Cheers,
> Gavin
>
> On Aug 9, 2007, at 9:13 PM, Mike Cherry wrote:
>
>> I think 1:1 is the intent.
>>
>> The documentation says the DB_Object_Symbol field has cardinality
>> of 1. The checking script is looking for the pipe symbol in that
>> field. It requires just one symbol, and will not allow zero or
>> more than one symbol to be started on a line.
>>
>> The checking script will not find multiple relationships if they
>> are spread across multiple lines in the file. The filtering
>> script is concerned with format of the information, cardinality,
>> and some very basic things like is an abbreviation okay. As
>> written is does not compare two lines within the file, it just
>> checked each line independently.
>>
>> A check of the database could report these errors. There is an
>> easy UNIX command method to check for this problem. For example
>> with the pombe file, its all one long command on one line:
>>
>> % gzcat gene_association.GeneDB_Spombe.gz | cut -f2,3 | sort -u |
>> cut -f1 | sort | uniq -c | sort -rn | grep -v ' 1 '
>>
>> Any ID in the result that has a number greater than 1 is an ID
>> that has more than 1 symbol associated somewhere within the gene
>> association file. For the current pombe file that would be 388 of
>> the 5073 gene IDs.
>>
>> Yes the RGD (96), pseudocap (1) and WormBase (16) gene association
>> files all have a few of this type of issue.
>>
>> -Mike
>>
>>
>> On Aug 9, 2007, at 7:38 PM, Gavin Sherlock wrote:
>>
>>> Hi all,
>>>
>>> An issue came up with GO::TermFinder, because it chokes on files
>>> where the relationship between DB_Object_ID and DB_Object_Symbol
>>> is not 1:1, and there are a number of files that have for
>>> instance a 1:2 relationship between these columns, e.g.:
>>>
>>> GeneDB_Spombe: SPCC777.13 maps to SPCC777.13, vps35
>>> pseudocap: PA5429 maps to aspA, adhA
>>> RGD: RGD:1359623 maps to Tuba4a, Tuba4
>>> WB: WBGene00000386 maps to cdc-25.1, cdc25.1
>>>
>>> My question is, should this be a 1:1 relationship, and the
>>> annotation files checking script needs to reject files that
>>> deviate from that (presumably these additional names would become
>>> synonyms instead), or is a 1:2 or more relationship allowed
>>> between those columns, in which case, I'll have to modify
>>> GO::TermFinder appropriately.
>>>
>>> As an additional data point, the pombe file actually lists both
>>> SPCC777.13 and vps35 as synonyms for the gene too :
>>>
>>> whitbread 1001 % grep 'SPCC777.13' gene_association.GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 SPCC777.13 GO:
>>> 0003674 GO_REF:0000015 ND
>>> F gene taxon:4896
>>> 20070711GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0005768
>>> PMID:16622069 IMP C retromer complex subunit
>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>> 20060424 GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0030904
>>> PMID:16622069 IMP C retromer complex subunit
>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>> 20040625 GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0030904
>>> PMID:16622069 ISS SGD:S000003690 C retromer complex
>>> subunit Vps35 SPCC777.13|vps35gene taxon:4896
>>> 20040625 GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0006886
>>> PMID:16622069 IMP P retromer complex subunit
>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>> 20040625 GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0042147
>>> PMID:16622069 IMP P retromer complex subunit
>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>> 20060424 GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0030437
>>> PMID:15189449 IMP P retromer complex subunit
>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>> 20040625 GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0005829
>>> PMID:16823372 IDA C retromer complex subunit
>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>> 20060724 GeneDB_Spombe
>>>
>>> - is there a rule (I couldn't find one) that says the synonyms
>>> should not repeat the DB_Object_ID and DB_Object_Symbol, or
>>> should there be? Would it save any space in the file sizes?
>>>
>>> Cheers,
>>> Gavin
>>> ________________________________________________________
>>>
>>> Gavin Sherlock
>>> Dept. of Genetics
>>> S201A, Grant Building,
>>> Stanford University Medical School,
>>> Stanford,
>>> CA 94305-5120
>>>
>>> Tel: 650 498 6012
>>> Fax: 650 724 3701
>>>
>>>
More information about the Go
mailing list