[go] mapping between DB_Object_ID and DB_Object_Symbol
Gavin Sherlock
sherlock at genome.Stanford.EDU
Thu Aug 9 21:26:05 PDT 2007
Hi Mike,
It's an easy check to add. For a given file, where $databaseId and
$name are the DB_Object_ID and DB_Object_Symbol for the current line
respectively, something like:
if (exists ($databaseId2StandardName{$databaseId}) && $name ne
$databaseId2StandardName{$databaseId}){
# do something to say that the databaseId has more than one standard
name in the file, and thus reject it
}else{
# process
# now record that we saw it
$databaseId2StandardName{$databaseId} = $name;
}
works just fine (and is essentially what my GO::TernFinder code
does. Probably the reverse check should be done to - i.e. a
DB_Object_Symbol maps to only one DB_Object_ID.
If it is part of the spec (and is spelled out on the annotation file
format page, which it isn't currently), then I think files that don't
follow the rule should be rejected.
Cheers,
Gavin
On Aug 9, 2007, at 9:13 PM, Mike Cherry wrote:
> I think 1:1 is the intent.
>
> The documentation says the DB_Object_Symbol field has cardinality
> of 1. The checking script is looking for the pipe symbol in that
> field. It requires just one symbol, and will not allow zero or
> more than one symbol to be started on a line.
>
> The checking script will not find multiple relationships if they
> are spread across multiple lines in the file. The filtering script
> is concerned with format of the information, cardinality, and some
> very basic things like is an abbreviation okay. As written is does
> not compare two lines within the file, it just checked each line
> independently.
>
> A check of the database could report these errors. There is an
> easy UNIX command method to check for this problem. For example
> with the pombe file, its all one long command on one line:
>
> % gzcat gene_association.GeneDB_Spombe.gz | cut -f2,3 | sort -u |
> cut -f1 | sort | uniq -c | sort -rn | grep -v ' 1 '
>
> Any ID in the result that has a number greater than 1 is an ID that
> has more than 1 symbol associated somewhere within the gene
> association file. For the current pombe file that would be 388 of
> the 5073 gene IDs.
>
> Yes the RGD (96), pseudocap (1) and WormBase (16) gene association
> files all have a few of this type of issue.
>
> -Mike
>
>
> On Aug 9, 2007, at 7:38 PM, Gavin Sherlock wrote:
>
>> Hi all,
>>
>> An issue came up with GO::TermFinder, because it chokes on files
>> where the relationship between DB_Object_ID and DB_Object_Symbol
>> is not 1:1, and there are a number of files that have for instance
>> a 1:2 relationship between these columns, e.g.:
>>
>> GeneDB_Spombe: SPCC777.13 maps to SPCC777.13, vps35
>> pseudocap: PA5429 maps to aspA, adhA
>> RGD: RGD:1359623 maps to Tuba4a, Tuba4
>> WB: WBGene00000386 maps to cdc-25.1, cdc25.1
>>
>> My question is, should this be a 1:1 relationship, and the
>> annotation files checking script needs to reject files that
>> deviate from that (presumably these additional names would become
>> synonyms instead), or is a 1:2 or more relationship allowed
>> between those columns, in which case, I'll have to modify
>> GO::TermFinder appropriately.
>>
>> As an additional data point, the pombe file actually lists both
>> SPCC777.13 and vps35 as synonyms for the gene too :
>>
>> whitbread 1001 % grep 'SPCC777.13' gene_association.GeneDB_Spombe
>> GeneDB_Spombe SPCC777.13 SPCC777.13 GO:
>> 0003674 GO_REF:0000015 ND
>> F gene taxon:4896 20070711GeneDB_Spombe
>> GeneDB_Spombe SPCC777.13 vps35 GO:0005768
>> PMID:16622069 IMP C retromer complex subunit
>> Vps35 SPCC777.13|vps35 gene taxon:4896
>> 20060424 GeneDB_Spombe
>> GeneDB_Spombe SPCC777.13 vps35 GO:0030904
>> PMID:16622069 IMP C retromer complex subunit
>> Vps35 SPCC777.13|vps35 gene taxon:4896
>> 20040625 GeneDB_Spombe
>> GeneDB_Spombe SPCC777.13 vps35 GO:0030904
>> PMID:16622069 ISS SGD:S000003690 C retromer complex
>> subunit Vps35 SPCC777.13|vps35gene taxon:4896
>> 20040625 GeneDB_Spombe
>> GeneDB_Spombe SPCC777.13 vps35 GO:0006886
>> PMID:16622069 IMP P retromer complex subunit
>> Vps35 SPCC777.13|vps35 gene taxon:4896
>> 20040625 GeneDB_Spombe
>> GeneDB_Spombe SPCC777.13 vps35 GO:0042147
>> PMID:16622069 IMP P retromer complex subunit
>> Vps35 SPCC777.13|vps35 gene taxon:4896
>> 20060424 GeneDB_Spombe
>> GeneDB_Spombe SPCC777.13 vps35 GO:0030437
>> PMID:15189449 IMP P retromer complex subunit
>> Vps35 SPCC777.13|vps35 gene taxon:4896
>> 20040625 GeneDB_Spombe
>> GeneDB_Spombe SPCC777.13 vps35 GO:0005829
>> PMID:16823372 IDA C retromer complex subunit
>> Vps35 SPCC777.13|vps35 gene taxon:4896
>> 20060724 GeneDB_Spombe
>>
>> - is there a rule (I couldn't find one) that says the synonyms
>> should not repeat the DB_Object_ID and DB_Object_Symbol, or should
>> there be? Would it save any space in the file sizes?
>>
>> Cheers,
>> Gavin
>> ________________________________________________________
>>
>> Gavin Sherlock
>> Dept. of Genetics
>> S201A, Grant Building,
>> Stanford University Medical School,
>> Stanford,
>> CA 94305-5120
>>
>> Tel: 650 498 6012
>> Fax: 650 724 3701
>>
>>
More information about the Go
mailing list