[go] mapping between DB_Object_ID and DB_Object_Symbol

Gavin Sherlock sherlock at genome.Stanford.EDU
Thu Aug 9 21:26:05 PDT 2007


Hi Mike,

It's an easy check to add.  For a given file, where $databaseId and  
$name are the DB_Object_ID and DB_Object_Symbol for the current line  
respectively, something like:

if (exists ($databaseId2StandardName{$databaseId}) && $name ne  
$databaseId2StandardName{$databaseId}){

	# do something to say that the databaseId has more than one standard  
name in the file, and thus reject it

}else{

	# process

	# now record that we saw it

	$databaseId2StandardName{$databaseId} = $name;

}

works just fine (and is essentially what my GO::TernFinder code  
does.  Probably the reverse check should be done to - i.e. a  
DB_Object_Symbol maps to only one DB_Object_ID.

If it is part of the spec (and is spelled out on the annotation file  
format page, which it isn't currently), then I think files that don't  
follow the rule should be rejected.

Cheers,
Gavin

On Aug 9, 2007, at 9:13 PM, Mike Cherry wrote:

> I think 1:1 is the intent.
>
> The documentation says the DB_Object_Symbol field has cardinality  
> of 1.  The checking script is looking for the pipe symbol in that  
> field.  It requires just one symbol, and will not allow zero or  
> more than one symbol to be started on a line.
>
> The checking script will not find multiple relationships if they  
> are spread across multiple lines in the file.  The filtering script  
> is concerned with format of the information, cardinality, and some  
> very basic things like is an abbreviation okay.  As written is does  
> not compare two lines within the file, it just checked each line  
> independently.
>
> A check of the database could report these errors.  There is an  
> easy UNIX command method to check for this problem.  For example  
> with the pombe file, its all one long command on one line:
>
> % gzcat gene_association.GeneDB_Spombe.gz | cut -f2,3 | sort -u |  
> cut -f1 | sort | uniq -c | sort -rn | grep -v '  1 '
>
> Any ID in the result that has a number greater than 1 is an ID that  
> has more than 1 symbol associated somewhere within the gene  
> association file.  For the current pombe file that would be 388 of  
> the 5073 gene IDs.
>
> Yes the RGD (96), pseudocap (1) and WormBase (16) gene association  
> files all have a few of this type of issue.
>
> -Mike
>
>
> On Aug 9, 2007, at 7:38 PM, Gavin Sherlock wrote:
>
>> Hi all,
>>
>> An issue came up with GO::TermFinder, because it chokes on files  
>> where the relationship between DB_Object_ID and DB_Object_Symbol  
>> is not 1:1, and there are a number of files that have for instance  
>> a 1:2 relationship between these columns, e.g.:
>>
>> GeneDB_Spombe: SPCC777.13 maps to SPCC777.13, vps35
>> pseudocap: PA5429 maps to aspA, adhA
>> RGD: RGD:1359623 maps to Tuba4a, Tuba4
>> WB: WBGene00000386 maps to cdc-25.1, cdc25.1
>>
>> My question is, should this be a 1:1 relationship, and the  
>> annotation files checking script needs to reject files that  
>> deviate from that (presumably these additional names would become  
>> synonyms instead), or is a 1:2 or more relationship allowed  
>> between those columns, in which case, I'll have to modify  
>> GO::TermFinder appropriately.
>>
>> As an additional data point, the pombe file actually lists both  
>> SPCC777.13 and vps35 as synonyms for the gene too :
>>
>> whitbread 1001 % grep 'SPCC777.13' gene_association.GeneDB_Spombe
>> GeneDB_Spombe   SPCC777.13      SPCC777.13              GO: 
>> 0003674      GO_REF:0000015  ND               
>> F                       gene    taxon:4896      20070711GeneDB_Spombe
>> GeneDB_Spombe   SPCC777.13      vps35           GO:0005768       
>> PMID:16622069  IMP              C       retromer complex subunit  
>> Vps35  SPCC777.13|vps35       gene     taxon:4896       
>> 20060424        GeneDB_Spombe
>> GeneDB_Spombe   SPCC777.13      vps35           GO:0030904       
>> PMID:16622069  IMP              C       retromer complex subunit  
>> Vps35  SPCC777.13|vps35       gene     taxon:4896       
>> 20040625        GeneDB_Spombe
>> GeneDB_Spombe   SPCC777.13      vps35           GO:0030904       
>> PMID:16622069  ISS      SGD:S000003690  C       retromer complex  
>> subunit Vps35  SPCC777.13|vps35gene    taxon:4896       
>> 20040625        GeneDB_Spombe
>> GeneDB_Spombe   SPCC777.13      vps35           GO:0006886       
>> PMID:16622069  IMP              P       retromer complex subunit  
>> Vps35  SPCC777.13|vps35       gene     taxon:4896       
>> 20040625        GeneDB_Spombe
>> GeneDB_Spombe   SPCC777.13      vps35           GO:0042147       
>> PMID:16622069  IMP              P       retromer complex subunit  
>> Vps35  SPCC777.13|vps35       gene     taxon:4896       
>> 20060424        GeneDB_Spombe
>> GeneDB_Spombe   SPCC777.13      vps35           GO:0030437       
>> PMID:15189449  IMP              P       retromer complex subunit  
>> Vps35  SPCC777.13|vps35       gene     taxon:4896       
>> 20040625        GeneDB_Spombe
>> GeneDB_Spombe   SPCC777.13      vps35           GO:0005829       
>> PMID:16823372  IDA              C       retromer complex subunit  
>> Vps35  SPCC777.13|vps35       gene     taxon:4896       
>> 20060724        GeneDB_Spombe
>>
>> - is there a rule (I couldn't find one) that says the synonyms  
>> should not repeat the DB_Object_ID and DB_Object_Symbol, or should  
>> there be?  Would it save any space in the file sizes?
>>
>> Cheers,
>> Gavin
>> ________________________________________________________
>>
>> Gavin Sherlock
>> Dept. of Genetics
>> S201A, Grant Building,
>> Stanford University Medical School,
>> Stanford,
>> CA 94305-5120
>>
>> Tel: 650 498 6012
>> Fax: 650 724 3701
>>
>>




More information about the Go mailing list