[go] mapping between DB_Object_ID and DB_Object_Symbol
Valerie Wood
val at sanger.ac.uk
Fri Aug 10 05:40:13 PDT 2007
Hi David,
Chris is refering to our (GeneDB) odd practice of repeating the names in
the synonyms column, so the gene name might be repeated in the synonym
field for a single gene) Rather than between genes (which is OK for
synonyms).
The reason we did this is explained in later e-mail
Val
David Hill wrote:
>
>>
>>
>> I think it's best not to repeat symbols as synonyms, as you lead
>> people to believe that these will always be present, which may
>> potentially lead to them implementing buggy software (if they are
>> extremely sloppy).
>
> But, synonyms for gene symbols are harvested directly from the
> literature. Unfortunately, bench scientists don't often consider
> whether the 'handle' they are using for their gene is unique. This is
> a huge issue in mouse and often a lot of the work of a curator is to
> determine which gene an author is actually talking about. However,
> every official gene symbol should only correspond to one database gene
> ID. All other uses of symbols than the official symbol for a gene
> should go in the 'synonyms' field.
>
>> Those writing software correctly have to defensively implement some
>> kind of filter, if they want to avoid reporting back (mildly
>> confusing) duplicates to their users. Consistency is always a good
>> thing.
>>
>> I think the 1:1 violation is more serious though
>>
>> On Aug 9, 2007, at 7:38 PM, Gavin Sherlock wrote:
>>
>>> Hi all,
>>>
>>> An issue came up with GO::TermFinder, because it chokes on files
>>> where the relationship between DB_Object_ID and DB_Object_Symbol is
>>> not 1:1, and there are a number of files that have for instance a
>>> 1:2 relationship between these columns, e.g.:
>>>
>>> GeneDB_Spombe: SPCC777.13 maps to SPCC777.13, vps35
>>> pseudocap: PA5429 maps to aspA, adhA
>>> RGD: RGD:1359623 maps to Tuba4a, Tuba4
>>> WB: WBGene00000386 maps to cdc-25.1, cdc25.1
>>>
>>> My question is, should this be a 1:1 relationship, and the
>>> annotation files checking script needs to reject files that deviate
>>> from that (presumably these additional names would become synonyms
>>> instead), or is a 1:2 or more relationship allowed between those
>>> columns, in which case, I'll have to modify GO::TermFinder
>>> appropriately.
>>>
>>> As an additional data point, the pombe file actually lists both
>>> SPCC777.13 and vps35 as synonyms for the gene too :
>>>
>>> whitbread 1001 % grep 'SPCC777.13' gene_association.GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 SPCC777.13
>>> GO:0003674 GO_REF:0000015 ND
>>> F gene taxon:4896 20070711GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0005768
>>> PMID:16622069 IMP C retromer complex subunit
>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>> 20060424 GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0030904
>>> PMID:16622069 IMP C retromer complex subunit
>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>> 20040625 GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0030904
>>> PMID:16622069 ISS SGD:S000003690 C retromer complex
>>> subunit Vps35 SPCC777.13|vps35gene taxon:4896
>>> 20040625 GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0006886
>>> PMID:16622069 IMP P retromer complex subunit
>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>> 20040625 GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0042147
>>> PMID:16622069 IMP P retromer complex subunit
>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>> 20060424 GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0030437
>>> PMID:15189449 IMP P retromer complex subunit
>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>> 20040625 GeneDB_Spombe
>>> GeneDB_Spombe SPCC777.13 vps35 GO:0005829
>>> PMID:16823372 IDA C retromer complex subunit
>>> Vps35 SPCC777.13|vps35 gene taxon:4896
>>> 20060724 GeneDB_Spombe
>>>
>>> - is there a rule (I couldn't find one) that says the synonyms
>>> should not repeat the DB_Object_ID and DB_Object_Symbol, or should
>>> there be? Would it save any space in the file sizes?
>>>
>>> Cheers,
>>> Gavin
>>> ________________________________________________________
>>>
>>> Gavin Sherlock
>>> Dept. of Genetics
>>> S201A, Grant Building,
>>> Stanford University Medical School,
>>> Stanford,
>>> CA 94305-5120
>>>
>>> Tel: 650 498 6012
>>> Fax: 650 724 3701
>>>
>>>
>>
>
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Go
mailing list