[go] Re: dictyBase gp2protein file

Mike Cherry cherry at stanford.edu
Fri Jan 18 16:14:48 PST 2008


Pascale,

** every group should check the list below **

I've figured out this problem.  The double entry has nothing to do  
with the gp2protein file.  The problem was the filtering script.  This  
script uses a list of taxonomy IDs that it excludes unless from a  
particular source.  The current list is below so others can check that  
everything is still current.

The Q54IT9 entry was getting into the database from the goa_uniprot  
file because they state the taxonomy ID as taxon:352472 (Dictyostelium  
discoideum AX4).  I only had taxon:44689 (Dictyostelium discoideum)  
and taxon:5782 (Dictyostelium) being tagged for dictyBase.

The script requires every taxonomy ID to be explicitly stated.  I've  
updated the script and am reprocessing the goa_uniprot file now.

-Mike

     'taxon:5476'=>'cgd',
     'taxon:352472'=>'dictyBase',
     'taxon:44689'=>'dictyBase',
     'taxon:5782'=>'dictyBase',
     'taxon:7227'=>'fb',
     'taxon:5664'=>'GeneDB_Lmajor',
     'taxon:5833'=>'GeneDB_Pfalciparum',
     'taxon:4896'=>'GeneDB_Spombe',
     'taxon:185431'=>'GeneDB_Tbrucei',
     'taxon:37546'=>'GeneDB_tsetse',
     'taxon:9031'=>'goa_chicken',
     'taxon:9913'=>'goa_cow',
     'taxon:9606'=>'goa_human',
     'taxon:110450'=>'gramene_oryza',
     'taxon:110451'=>'gramene_oryza',
     'taxon:127571'=>'gramene_oryza',
     'taxon:29689'=>'gramene_oryza',
     'taxon:29690'=>'gramene_oryza',
     'taxon:364099'=>'gramene_oryza',
     'taxon:364100'=>'gramene_oryza',
     'taxon:39946'=>'gramene_oryza',
     'taxon:39947'=>'gramene_oryza',
     'taxon:40148'=>'gramene_oryza',
     'taxon:40149'=>'gramene_oryza',
     'taxon:4528'=>'gramene_oryza',
     'taxon:4529'=>'gramene_oryza',
     'taxon:4530'=>'gramene_oryza',
     'taxon:4532'=>'gramene_oryza',
     'taxon:4533'=>'gramene_oryza',
     'taxon:4534'=>'gramene_oryza',
     'taxon:4535'=>'gramene_oryza',
     'taxon:4536'=>'gramene_oryza',
     'taxon:4537'=>'gramene_oryza',
     'taxon:4538'=>'gramene_oryza',
     'taxon:4539'=>'gramene_oryza',
     'taxon:52545'=>'gramene_oryza',
     'taxon:63629'=>'gramene_oryza',
     'taxon:65489'=>'gramene_oryza',
     'taxon:65491'=>'gramene_oryza',
     'taxon:77588'=>'gramene_oryza',
     'taxon:83307'=>'gramene_oryza',
     'taxon:83308'=>'gramene_oryza',
     'taxon:83309'=>'gramene_oryza',
     'taxon:10090'=>'mgi',
     'taxon:10116'=>'rgd',
     'taxon:285006'=>'sgd',
     'taxon:307796'=>'sgd',
     'taxon:41870'=>'sgd',
     'taxon:4932'=>'sgd',
     'taxon:3702'=>'tair',
     'taxon:212042'=>'tigr_Aphagocytophilum',
     'taxon:198094'=>'tigr_Banthracis',
     'taxon:227377'=>'tigr_Cburnetii',
     'taxon:246194'=>'tigr_Chydrogenoformans',
     'taxon:195099'=>'tigr_Cjejuni',
     'taxon:195103'=>'tigr_Cperfringens',
     'taxon:167879'=>'tigr_Cpsychrerythraea',
     'taxon:243164'=>'tigr_Dethenogenes',
     'taxon:205920'=>'tigr_Echaffeensis',
     'taxon:243231'=>'tigr_Gsulfurreducens',
     'taxon:228405'=>'tigr_Hneptunium',
     'taxon:265669'=>'tigr_Lmonocytogenes',
     'taxon:243233'=>'tigr_Mcapsulatus',
     'taxon:222891'=>'tigr_Nsennetsu',
     'taxon:220664'=>'tigr_Pfluorescens',
     'taxon:223283'=>'tigr_Psyringae',
     'taxon:264730'=>'tigr_Psyringae_phaseolicola',
     'taxon:211586'=>'tigr_Soneidensis',
     'taxon:246200'=>'tigr_Spomeroyi',
     'taxon:5691'=>'tigr_Tbrucei_chr2',
     'taxon:686'=>'tigr_Vcholerae',
     'taxon:6239'=>'wb',
     'taxon:7955'=>'zfin',




On Jan 18, 2008, at 10:28 AM, Pascale Gaudet wrote:

> Well, is this what you want? The dataflow is: we send data to  
> GenBank, then Uniprot integrates it. Does it make sense for you to  
> have both version of the same sequence?
>
> Stan Dong wrote:
>>
>> From the two fasta headers, one sequence is from dicty and another  
>> from uniprot. Is this a problem or just result of submission from  
>> two distinct sources?
>>
>> >DICTYBASE|DDB0191090 symbol:sadA species:44689
>> >UNIPROTKB|Q54IT9 symbol:Q54IT9_DICDI species:352472
>>
>> -Stan
>>
>> On Jan 18, 2008, at 9:51 AM, Pascale Gaudet wrote:
>>
>>> Hello,
>>>
>>> I have another question about our gp2protein file. It looks like  
>>> now our sequences have been loaded in the GO database :) but some  
>>> have been loaded in duplicates. For example:
>>> http://amigo.geneontology.org/cgi-bin/amigo/go.cgi?search_constraint=gp&view=details&session_id=1067b1200677806&gp=DDB0191090
>>> http://amigo.geneontology.org/cgi-bin/amigo/go.cgi?search_constraint=gp&view=details&session_id=1067b1200677806&gp=Q54IT9
>>>
>>> Anything we can do to prevent that?
>>>



More information about the Go mailing list