[go] additional checks for gene association files

Mike Cherry cherry at stanford.edu
Tue Sep 18 14:04:30 PDT 2007


I have an update ready for the script that checks the syntax of the  
submitted gene association files.  All of these have been discussed  
and approved, or there was no comment.  I'll wait until after the GOC  
meetings next week to put this into production.  Please let me know  
if you have any problems with these.

-Mike


1. if a database name is included in DB_OBJECT_ID it must be a valid  
name found in the go/doc/GO.xrf_abbs file.  This should have happened  
last year but I found a bug that caused this check to never run.   
Currently none of the gene association files have this type of problem.

2. check for double colons ('::') in DB_OBJECT_ID, GOID, REFERENCE,  
WITH and TAXON ID fields.  If a double colon is found that line is  
not included in the filtered output, an error message is created.   
There are only two of these errors in the current files, one in  
GeneDB_Tbrucei and the other in RGD.

3. check for multiple DB_OBJECT_SYMBOLs associated with a  
DB_OBJECT_ID.  This error was reported by Gavin Sherlock and  
discussed in late July and early August.  There was no comment so I'm  
assuming everyone agrees this is an error.  The checking script will  
allow one symbol to be associated with an ID.  If a second symbol is  
found those lines containing the second (or third, ...) symbol will  
not be included in the filtered file.  An error is created in the  
report.  At the moment there are errors of this type in the RGD and  
WB files.  There were errors in the pseudocap file but I've fixed  
those as that file is not active.





More information about the Go mailing list