Table 1.

Numbers of Sequences and Paralogs Found for Each of the 14 Model Plants Investigated

SpeciesSequences in Initial DatasetaSequences in Cleaned DatasetbParalogscPercentage of ParalogsdGene FamilieseGene Family SizefDuplication Event with Median Ks < 2g
M. crystallinum (ice plant)6,9756,9201,33419%5192.57380
H. annuus (sunflower)15,24815,1962,71318%1,0072.69625
H. vulgare (barley)39,66739,1086,38816%2,0623.101,523
L. sativa (lettuce)21,96021,8035,16024%1,9052.711,634
Z. mays (maize)32,36232,27210,34632%3,7672.752,015
L. esculentum (tomato)32,31730,8387,96326%2,8762.772,222
S. tuberosum (potato)23,56123,4186,59728%2,4522.692,462
G. hirsutum (tetraploid cotton)8,6608,6462,21226%7972.78799
G. arboreum (diploid cotton)18,96218,7918,72146%2,6863.252,600
M. trunculata (barrel medic)33,76533,3807,96124%2,8132.832,653
T. aestivum (wheat)52,35252,19719,12837%5,7193.344,362
G. max (soybean)55,99055,76217,66332%6,0672.915,076
O. sativa (rice) gene models56,05618,5629,14949%2,3343.924,977
O. sativa (rice) unigenes30,08729,8577,00623%2,6522.641,250
Arabidopsis gene models26,15725,55711,93747%3,9783.006,801
Arabidopsis unigenes23,45819,5543,70819%1,4832.501,562
  • a Number of sequences in dataset after download.

  • b Number of sequences in dataset after removing redundant entries of the same gene and transposable element sequences. For rice, hypothetical protein gene models were also removed in the cleaning process.

  • c Number of paralogous sequences found in the cleaned dataset using nucleotide alignment search.

  • d Percentage of paralogous sequences found in the cleaned dataset.

  • e Number of gene families constructed with paralogous sequences from column 3 using single linkage clustering.

  • f Average gene family size (number of genes per family).

  • g Number of duplication events used in the distributions in Figure 2 and for which median Ks values are < 2.