[go] total genes/gene products annotated versus genome size

Judith Blake jblake at informatics.jax.org
Thu Dec 20 06:18:35 PST 2007


Hi Sue,
you have set a difficult task.

take the mouse situation for example

the 2nd column is 'how many total experimental annotations'. ok, this is
fine. But in the mouse metrics here that we collect each week, the
number we like to use is 'how many genes with manual annotation
[includes ISS, RCA, IC and ND], and that number is '10,058'. So this
number is not exactly 'experimental'. that number here is '47,986'. So
does your number of 28,540 mouse experimental annotations reflect the
ones with the five experimental evidence codes?

The third column is 'how many annotations', and in mouse on 7 Dec., that
number was 148,372; again the other number of interest is the number of
genes with any annotation, and that number is '18315', which is
reflected in your table in column 5.

As to column 6, as you point out, this is a tricky number. In mouse, we
have '27,289' gene objects with associated genomic data. This is maybe
the number that you should use. The other number includes inherited
phenotypes, genetraps, and other gene-type markers without localization
to the genome. So, in our system, as probably in all the others, this
number needs to be clarified.

AND we know that the 27,289 number is high because we are just
integrated build 37 and there are some thousand of genes from NCBI gene
model sets and some thousands of genes from Ensembl gene model sets that
do not overlap and are in the database and are currently in the curation
queue.

Also, I would not favor combining protein coding genes, pseudogenes, and
functional RNAs in the same set. For one thing, we should not have any
annotations to pseudogenes. That was decided some time back. Also, for
mouse at least, while we have good algorithms for identifying protein
coding genes, we have little or no confidence in any representation of
numbers of functional RNAs. So this would lead to a distortion of
comparison between genomes.

[by the way, the problem for others in using transcripts is that, for
mouse at least, we have many multiples of transcripts for the same
protein or protein-coding genomic region.]

I would favor revising your request number one to be only for
proteincoding genes; or if of interest to others, maybe 1a and 1b
(RNAgenes) and you would not have an entry from mouse for that.


That's all for now. :)

Judy

I'm not





Sue Rhee wrote:
> Dear GO consortium,
>
> It looks like different GOC members are submitting annotations to
> different objects (genes, proteins, transcripts) and the total number
> of the same objects in the genome is not included in the association
> files. Also, some databases are submitting 'genes' but it looks like
> they are more like transcripts (e.g. TAIR). Therefore, it is difficult
> for me (or anyone else) to generate a simple table as follows. Michael
> suggested that I use genes rather than gene products for this table,
> but I am having trouble doing this by querying the GO database.
>
> If the organism(s) that you are submitting annotation files are
> included in the following table, would you kindly send me tthe
> following three numbers?
>
> 1. total number of genes (including protein-coding, RNA and
> pseudogenes) in the genome as of October 2007
> 2. total number of genes annotated with GO as of October 2007
> 3. total number of genes annotated with evidence codes IDA, IMP, IGI,
> IPI, IEP as of October 2007
>
> Thanks much,
> Sue
>
> species (NCBI taxon ID)
>
> 	
>
> experi-mental anno-tations
>
> 	
>
> total anno-tations
>
> 	
>
> % expt anno-tations
>
> 	
>
> annotated gene products
>
> 	
>
> total gene products^a
>
> 	
>
> % annotated^b
>
> 	
>
> % known in genome^c
>
> baker’s yeast (4932)
>
> 	
>
> 23993
>
> 	
>
> 36746
>
> 	
>
> 65.3%
>
> 	
>
> 6476
>
> 	
>
> 7137
>
> 	
>
> 90.7%
>
> 	
>
> 59.2%
>
> fission yeast (4896)
>
> 	
>
> 12343
>
> 	
>
> 33385
>
> 	
>
> 37.0%
>
> 	
>
> 5243
>
> 	
>
> 5463
>
> 	
>
> 96.0%
>
> 	
>
> 35.5%
>
> fruit fly (7227)
>
> 	
>
> 14148
>
> 	
>
> 20303
>
> 	
>
> 69.7%
>
> 	
>
> 10581
>
> 	
>
> 30971
>
> 	
>
> 34.2%
>
> 	
>
> 23.8%
>
> worm (6239)
>
> 	
>
> 27472
>
> 	
>
> 68594
>
> 	
>
> 40.1%
>
> 	
>
> 12534
>
> 	
>
> 28866
>
> 	
>
> 43.4%
>
> 	
>
> 17.4%
>
> Candida albicans (5476)
>
> 	
>
> 3413
>
> 	
>
> 5326
>
> 	
>
> 64.1%
>
> 	
>
> 1262
>
> 	
>
> 6344
>
> 	
>
> 19.9%
>
> 	
>
> 12.7%
>
> arabidopsis (3702)
>
> 	
>
> 14060
>
> 	
>
> 103850
>
> 	
>
> 13.5%
>
> 	
>
> 34683
>
> 	
>
> 42929
>
> 	
>
> 80.8%
>
> 	
>
> 10.9%
>
> mouse (10090)
>
> 	
>
> 28540
>
> 	
>
> 133743
>
> 	
>
> 21.3%
>
> 	
>
> 18052
>
> 	
>
> 35466
>
> 	
>
> 50.9%
>
> 	
>
> 10.9%
>
> human^d (9606)
>
> 	
>
> 12437
>
> 	
>
> 166419
>
> 	
>
> 7.5%
>
> 	
>
> 33760
>
> 	
>
> 37,106
>
> 	
>
> 91.0%
>
> 	
>
> 6.8%
>
> slime mold (44689)
>
> 	
>
> 2691
>
> 	
>
> 30299
>
> 	
>
> 8.9%
>
> 	
>
> 4328
>
> 	
>
> 6729
>
> 	
>
> 64.3%
>
> 	
>
> 5.7%
>
> Pseudomonas aeruginosa PAO1 (208964)
>
> 	
>
> 1123
>
> 	
>
> 7377
>
> 	
>
> 15.2%
>
> 	
>
> 1519
>
> 	
>
> 5670
>
> 	
>
> 26.8%
>
> 	
>
> 4.1%
>
> rat^d (10116)
>
> 	
>
> 12986
>
> 	
>
> 135246
>
> 	
>
> 9.6%
>
> 	
>
> 11606
>
> 	
>
> 37,106
>
> 	
>
> 31.3%
>
> 	
>
> 3.0%
>
> zebrafish (7955)
>
> 	
>
> 4204
>
> 	
>
> 70340
>
> 	
>
> 6.0%
>
> 	
>
> 13194
>
> 	
>
> 27532
>
> 	
>
> 47.9%
>
> 	
>
> 2.9%
>
> Plasmodium falciparum (5833)
>
> 	
>
> 196
>
> 	
>
> 12026
>
> 	
>
> 1.6%
>
> 	
>
> 3165
>
> 	
>
> 5595
>
> 	
>
> 56.6%
>
> 	
>
> 0.9%
>
> Trypanosoma brucei (5691)
>
> 	
>
> 438
>
> 	
>
> 19006
>
> 	
>
> 2.3%
>
> 	
>
> 3921
>
> 	
>
> 10966
>
> 	
>
> 35.8%
>
> 	
>
> 0.8%
>
> rice (39947)
>
> 	
>
> 265
>
> 	
>
> 49582
>
> 	
>
> 0.5%
>
> 	
>
> 37548
>
> 	
>
> 58587
>
> 	
>
> 64.1%
>
> 	
>
> 0.3%
>
> cow^d (9913)
>
> 	
>
> 278
>
> 	
>
> 85951
>
> 	
>
> 0.3%
>
> 	
>
> 22727
>
> 	
>
> 42836
>
> 	
>
> 53.1%
>
> 	
>
> 0.2%
>
> chicken^d (9031)
>
> 	
>
> 179
>
> 	
>
> 55498
>
> 	
>
> 0.3%
>
> 	
>
> 16067
>
> 	
>
> 33566
>
> 	
>
> 47.9%
>
> 	
>
> 0.2%
>
>
> -- 
> Sue Rhee
> Staff Scientist
> Carnegie Institution, Department of Plant Biology
> 260 Panama Street, Stanford, CA 94305
> Email: (650) 325-1521 x251
> Fax: (650) 325-6857



More information about the Go mailing list