[go] mapping between DB_Object_ID and DB_Object_Symbol

Gavin Sherlock sherlock at genome.Stanford.EDU
Mon Aug 13 13:47:42 PDT 2007


Hi Mike,

In this case, the check is easy to implement in the filtering script,  
rather than as a post database loading check.  I think it would be  
much easier to deal with the file and reject it, rather than having  
to roll back the database to a previous successfully loaded file if a  
database check throws an error after loading.

Cheers,
Gavin

On Aug 10, 2007, at 1:31 PM, Mike Cherry wrote:

> Several times we have discussed adding checks on the database, ie.  
> after the data is loaded.  I guess that would create a warning  
> because the data was already loaded, as opposed to an error where  
> the data is rejected.
>
> I'm off for the next three weeks so will not work on this until  
> September.
>
> Questions to the group:
>
> 1. Are there other data checks that need to be implemented at the  
> same time.
> 2. What should happen when an "error" is found?  Remove all rows  
> with the gene id in question, or just all the rows that don't have  
> the same symbol as the first one observed.
>
> -Mike
>
>
> On Aug 9, 2007, at 9:26 PM, Gavin Sherlock wrote:
>
>> Hi Mike,
>>
>> It's an easy check to add.  For a given file, where $databaseId  
>> and $name are the DB_Object_ID and DB_Object_Symbol for the  
>> current line respectively, something like:
>>
>> if (exists ($databaseId2StandardName{$databaseId}) && $name ne  
>> $databaseId2StandardName{$databaseId}){
>>
>> 	# do something to say that the databaseId has more than one  
>> standard name in the file, and thus reject it
>>
>> }else{
>>
>> 	# process
>>
>> 	# now record that we saw it
>>
>> 	$databaseId2StandardName{$databaseId} = $name;
>>
>> }
>>
>> works just fine (and is essentially what my GO::TernFinder code  
>> does.  Probably the reverse check should be done to - i.e. a  
>> DB_Object_Symbol maps to only one DB_Object_ID.
>>
>> If it is part of the spec (and is spelled out on the annotation  
>> file format page, which it isn't currently), then I think files  
>> that don't follow the rule should be rejected.
>>
>> Cheers,
>> Gavin
>>
>> On Aug 9, 2007, at 9:13 PM, Mike Cherry wrote:
>>
>>> I think 1:1 is the intent.
>>>
>>> The documentation says the DB_Object_Symbol field has cardinality  
>>> of 1.  The checking script is looking for the pipe symbol in that  
>>> field.  It requires just one symbol, and will not allow zero or  
>>> more than one symbol to be started on a line.
>>>
>>> The checking script will not find multiple relationships if they  
>>> are spread across multiple lines in the file.  The filtering  
>>> script is concerned with format of the information, cardinality,  
>>> and some very basic things like is an abbreviation okay.  As  
>>> written is does not compare two lines within the file, it just  
>>> checked each line independently.
>>>
>>> A check of the database could report these errors.  There is an  
>>> easy UNIX command method to check for this problem.  For example  
>>> with the pombe file, its all one long command on one line:
>>>
>>> % gzcat gene_association.GeneDB_Spombe.gz | cut -f2,3 | sort -u |  
>>> cut -f1 | sort | uniq -c | sort -rn | grep -v '  1 '
>>>
>>> Any ID in the result that has a number greater than 1 is an ID  
>>> that has more than 1 symbol associated somewhere within the gene  
>>> association file.  For the current pombe file that would be 388  
>>> of the 5073 gene IDs.
>>>
>>> Yes the RGD (96), pseudocap (1) and WormBase (16) gene  
>>> association files all have a few of this type of issue.
>>>
>>> -Mike
>>>
>>>
>>> On Aug 9, 2007, at 7:38 PM, Gavin Sherlock wrote:
>>>
>>>> Hi all,
>>>>
>>>> An issue came up with GO::TermFinder, because it chokes on files  
>>>> where the relationship between DB_Object_ID and DB_Object_Symbol  
>>>> is not 1:1, and there are a number of files that have for  
>>>> instance a 1:2 relationship between these columns, e.g.:
>>>>
>>>> GeneDB_Spombe: SPCC777.13 maps to SPCC777.13, vps35
>>>> pseudocap: PA5429 maps to aspA, adhA
>>>> RGD: RGD:1359623 maps to Tuba4a, Tuba4
>>>> WB: WBGene00000386 maps to cdc-25.1, cdc25.1
>>>>
>>>> My question is, should this be a 1:1 relationship, and the  
>>>> annotation files checking script needs to reject files that  
>>>> deviate from that (presumably these additional names would  
>>>> become synonyms instead), or is a 1:2 or more relationship  
>>>> allowed between those columns, in which case, I'll have to  
>>>> modify GO::TermFinder appropriately.
>>>>
>>>> As an additional data point, the pombe file actually lists both  
>>>> SPCC777.13 and vps35 as synonyms for the gene too :
>>>>
>>>> whitbread 1001 % grep 'SPCC777.13' gene_association.GeneDB_Spombe
>>>> GeneDB_Spombe   SPCC777.13      SPCC777.13              GO: 
>>>> 0003674      GO_REF:0000015  ND               
>>>> F                       gene    taxon:4896       
>>>> 20070711GeneDB_Spombe
>>>> GeneDB_Spombe   SPCC777.13      vps35           GO:0005768       
>>>> PMID:16622069  IMP              C       retromer complex subunit  
>>>> Vps35  SPCC777.13|vps35       gene     taxon:4896       
>>>> 20060424        GeneDB_Spombe
>>>> GeneDB_Spombe   SPCC777.13      vps35           GO:0030904       
>>>> PMID:16622069  IMP              C       retromer complex subunit  
>>>> Vps35  SPCC777.13|vps35       gene     taxon:4896       
>>>> 20040625        GeneDB_Spombe
>>>> GeneDB_Spombe   SPCC777.13      vps35           GO:0030904       
>>>> PMID:16622069  ISS      SGD:S000003690  C       retromer complex  
>>>> subunit Vps35  SPCC777.13|vps35gene    taxon:4896       
>>>> 20040625        GeneDB_Spombe
>>>> GeneDB_Spombe   SPCC777.13      vps35           GO:0006886       
>>>> PMID:16622069  IMP              P       retromer complex subunit  
>>>> Vps35  SPCC777.13|vps35       gene     taxon:4896       
>>>> 20040625        GeneDB_Spombe
>>>> GeneDB_Spombe   SPCC777.13      vps35           GO:0042147       
>>>> PMID:16622069  IMP              P       retromer complex subunit  
>>>> Vps35  SPCC777.13|vps35       gene     taxon:4896       
>>>> 20060424        GeneDB_Spombe
>>>> GeneDB_Spombe   SPCC777.13      vps35           GO:0030437       
>>>> PMID:15189449  IMP              P       retromer complex subunit  
>>>> Vps35  SPCC777.13|vps35       gene     taxon:4896       
>>>> 20040625        GeneDB_Spombe
>>>> GeneDB_Spombe   SPCC777.13      vps35           GO:0005829       
>>>> PMID:16823372  IDA              C       retromer complex subunit  
>>>> Vps35  SPCC777.13|vps35       gene     taxon:4896       
>>>> 20060724        GeneDB_Spombe
>>>>
>>>> - is there a rule (I couldn't find one) that says the synonyms  
>>>> should not repeat the DB_Object_ID and DB_Object_Symbol, or  
>>>> should there be?  Would it save any space in the file sizes?
>>>>
>>>> Cheers,
>>>> Gavin
>>>> ________________________________________________________
>>>>
>>>> Gavin Sherlock
>>>> Dept. of Genetics
>>>> S201A, Grant Building,
>>>> Stanford University Medical School,
>>>> Stanford,
>>>> CA 94305-5120
>>>>
>>>> Tel: 650 498 6012
>>>> Fax: 650 724 3701
>>>>
>>>>




More information about the Go mailing list