[go] mapping between DB_Object_ID and DB_Object_Symbol

Mike Cherry cherry at stanford.edu
Thu Aug 9 21:13:57 PDT 2007


I think 1:1 is the intent.

The documentation says the DB_Object_Symbol field has cardinality of  
1.  The checking script is looking for the pipe symbol in that  
field.  It requires just one symbol, and will not allow zero or more  
than one symbol to be started on a line.

The checking script will not find multiple relationships if they are  
spread across multiple lines in the file.  The filtering script is  
concerned with format of the information, cardinality, and some very  
basic things like is an abbreviation okay.  As written is does not  
compare two lines within the file, it just checked each line  
independently.

A check of the database could report these errors.  There is an easy  
UNIX command method to check for this problem.  For example with the  
pombe file, its all one long command on one line:

% gzcat gene_association.GeneDB_Spombe.gz | cut -f2,3 | sort -u | cut  
-f1 | sort | uniq -c | sort -rn | grep -v '  1 '

Any ID in the result that has a number greater than 1 is an ID that  
has more than 1 symbol associated somewhere within the gene  
association file.  For the current pombe file that would be 388 of  
the 5073 gene IDs.

Yes the RGD (96), pseudocap (1) and WormBase (16) gene association  
files all have a few of this type of issue.

-Mike


On Aug 9, 2007, at 7:38 PM, Gavin Sherlock wrote:

> Hi all,
>
> An issue came up with GO::TermFinder, because it chokes on files  
> where the relationship between DB_Object_ID and DB_Object_Symbol is  
> not 1:1, and there are a number of files that have for instance a  
> 1:2 relationship between these columns, e.g.:
>
> GeneDB_Spombe: SPCC777.13 maps to SPCC777.13, vps35
> pseudocap: PA5429 maps to aspA, adhA
> RGD: RGD:1359623 maps to Tuba4a, Tuba4
> WB: WBGene00000386 maps to cdc-25.1, cdc25.1
>
> My question is, should this be a 1:1 relationship, and the  
> annotation files checking script needs to reject files that deviate  
> from that (presumably these additional names would become synonyms  
> instead), or is a 1:2 or more relationship allowed between those  
> columns, in which case, I'll have to modify GO::TermFinder  
> appropriately.
>
> As an additional data point, the pombe file actually lists both  
> SPCC777.13 and vps35 as synonyms for the gene too :
>
> whitbread 1001 % grep 'SPCC777.13' gene_association.GeneDB_Spombe
> GeneDB_Spombe   SPCC777.13      SPCC777.13              GO: 
> 0003674      GO_REF:0000015  ND               
> F                       gene    taxon:4896      20070711GeneDB_Spombe
> GeneDB_Spombe   SPCC777.13      vps35           GO:0005768       
> PMID:16622069  IMP              C       retromer complex subunit  
> Vps35  SPCC777.13|vps35       gene     taxon:4896       
> 20060424        GeneDB_Spombe
> GeneDB_Spombe   SPCC777.13      vps35           GO:0030904       
> PMID:16622069  IMP              C       retromer complex subunit  
> Vps35  SPCC777.13|vps35       gene     taxon:4896       
> 20040625        GeneDB_Spombe
> GeneDB_Spombe   SPCC777.13      vps35           GO:0030904       
> PMID:16622069  ISS      SGD:S000003690  C       retromer complex  
> subunit Vps35  SPCC777.13|vps35gene    taxon:4896       
> 20040625        GeneDB_Spombe
> GeneDB_Spombe   SPCC777.13      vps35           GO:0006886       
> PMID:16622069  IMP              P       retromer complex subunit  
> Vps35  SPCC777.13|vps35       gene     taxon:4896       
> 20040625        GeneDB_Spombe
> GeneDB_Spombe   SPCC777.13      vps35           GO:0042147       
> PMID:16622069  IMP              P       retromer complex subunit  
> Vps35  SPCC777.13|vps35       gene     taxon:4896       
> 20060424        GeneDB_Spombe
> GeneDB_Spombe   SPCC777.13      vps35           GO:0030437       
> PMID:15189449  IMP              P       retromer complex subunit  
> Vps35  SPCC777.13|vps35       gene     taxon:4896       
> 20040625        GeneDB_Spombe
> GeneDB_Spombe   SPCC777.13      vps35           GO:0005829       
> PMID:16823372  IDA              C       retromer complex subunit  
> Vps35  SPCC777.13|vps35       gene     taxon:4896       
> 20060724        GeneDB_Spombe
>
> - is there a rule (I couldn't find one) that says the synonyms  
> should not repeat the DB_Object_ID and DB_Object_Symbol, or should  
> there be?  Would it save any space in the file sizes?
>
> Cheers,
> Gavin
> ________________________________________________________
>
> Gavin Sherlock
> Dept. of Genetics
> S201A, Grant Building,
> Stanford University Medical School,
> Stanford,
> CA 94305-5120
>
> Tel: 650 498 6012
> Fax: 650 724 3701
>
>




More information about the Go mailing list