[go] additional checks for gene association files
Mike Cherry
cherry at stanford.edu
Tue Sep 18 14:04:30 PDT 2007
I have an update ready for the script that checks the syntax of the
submitted gene association files. All of these have been discussed
and approved, or there was no comment. I'll wait until after the GOC
meetings next week to put this into production. Please let me know
if you have any problems with these.
-Mike
1. if a database name is included in DB_OBJECT_ID it must be a valid
name found in the go/doc/GO.xrf_abbs file. This should have happened
last year but I found a bug that caused this check to never run.
Currently none of the gene association files have this type of problem.
2. check for double colons ('::') in DB_OBJECT_ID, GOID, REFERENCE,
WITH and TAXON ID fields. If a double colon is found that line is
not included in the filtered output, an error message is created.
There are only two of these errors in the current files, one in
GeneDB_Tbrucei and the other in RGD.
3. check for multiple DB_OBJECT_SYMBOLs associated with a
DB_OBJECT_ID. This error was reported by Gavin Sherlock and
discussed in late July and early August. There was no comment so I'm
assuming everyone agrees this is an error. The checking script will
allow one symbol to be associated with an ID. If a second symbol is
found those lines containing the second (or third, ...) symbol will
not be included in the filtered file. An error is created in the
report. At the moment there are errors of this type in the RGD and
WB files. There were errors in the pseudocap file but I've fixed
those as that file is not active.
More information about the Go
mailing list