|
|
||||||||
|
Extensive Duplication and Reshuffling in the Arabidopsis GenomeGuillaume Blanca, Abdelali Barakata, Romain Guyota, Richard Cookea, and Michel Delsenyaa Laboratoire Génome et Développement des Plantes, Unité Mixte de Recherche 5096, Centre National de la Recherche Scientifique, University of Perpignan, 66860 Perpignan Cédex, France Correspondence to: Richard Cooke, cooke{at}univ-perp.fr (E-mail), 33-468668499 (fax)
Systematic analysis of the Arabidopsis genome provides a basis for detailed studies of genome structure and evolution. Members of multigene families were mapped, and random sequence alignment was used to identify regions of extended similarity in the Arabidopsis genome. Detailed analysis showed that the number, order, and orientation of genes were conserved over large regions of the genome, revealing extensive duplication covering the majority of the known genomic sequence. Fine mapping analysis showed much rearrangement, resulting in a patchwork of duplicated regions that indicated deletion, insertion, tandem duplication, inversion, and reciprocal translocation. The implications of these observations for evolution of the Arabidopsis genome as well as their usefulness for analysis and annotation of the genomic sequence and in comparative genomics are discussed.
Since the decision to adopt Arabidopsis as a model for plant genome studies ~10 years ago, a concerted international effort has led to the accumulation of a vast amount of information. Generating and analyzing expressed sequence tags (ESTs) led the way in this effort (
A surprising observation based largely on EST studies was that despite consisting of only ~140 Mb, the Arabidopsis genome contains many small gene families (
Large-scale duplication in the Arabidopsis genome was proposed on the basis of comparative mapping of molecular markers in Arabidopsis and Brassica oleracea.
One way to obtain information on the position and extent of duplications is to locate members of small gene families and determine other conserved sequences in the vicinity of the different copies. Cytoplasmic ribosomal proteins have been shown to be encoded by small gene families (
Identification of a Large Duplicated Region Fig 1A shows a dot plot of sequences covering nine BACs (657,655 bp) on chromosome 2 (from BACs F5H14 to F14M13) and seven BACs (550,140 bp) on chromosome 4 (from BACs F20M13 to T5J17) for which discontinuous nucleotide sequence conservation over large regions can be seen as a staggered diagonal on the dot plot. These data suggest that the two chromosome regions correspond to a single ancestral region that has been duplicated and has undergone limited rearrangement, including accumulation of point mutations and large-scale insertion or deletion, singly or in combination, of fragments. Detailed analysis of a smaller region (boxed in Fig 1A) and comparison with the GenBank annotations of the sequences (Fig 1B) revealed similarities covering regions of only a few kilobases, which apparently correspond to annotated genes. Of 12 annotated genes on chromosome 4 and 11 on chromosome 2, nine showed marked nucleotide sequence similarity. For the remaining genes, no similarity could be determined, suggesting divergent evolution of the sequences or further small-scale rearrangements since the original duplication.
Detailed Structure and Expression
By using the BLASTN program (
The order and distribution of the genes according to the Watson or Crick strand are conserved, as would be expected after duplication of a block of genes, with two notable exceptions. First, one conserved gene on BACs T26C19 and T19P19 (on chromosomes 2 and 4, respectively) shows different polarity. Second, the presence of four copies of a gene on chromosome 4, with two copies on each strand and only two copies, both on the same strand on chromosome 2, indicates that a single original gene was probably duplicated in tandem before duplication of the region and that this was followed by a duplication with an inversion on chromosome 4. Five conserved tRNA genes are also found within this region. The presence of pairs of genes showing no nucleotide similarity in regions in which sequences of the majority of the duplicated genes have been conserved could arise either simply by sequence divergence or by more recent rearrangements. If rearrangements have occurred by insertion of genes from other chromosome locations, we would expect to detect nucleotide similarity between these nonconserved genes and sequences elsewhere in the genome. Therefore, we performed BLASTN alignments of all the corresponding predicted coding sequences with all known genomic sequences and found that in addition to the 59 genes from chromosome 2 duplicated on chromosome 4, substantially similar sequences for an additional 47 could be found elsewhere in the genome. The remaining 45 predicted genes shared no sequence similarity with the genomic sequence that is currently available. Thus, sequences similar to at least 70% of all predicted genes on the region of chromosome 2 shown in Fig 1A are found elsewhere in the genome. The identification of regions containing duplicate copies of many genes whose predicted protein products have highly similar or identical sequences raises the question of whether both copies are effectively expressed. Although expression data are not available for all genes, ESTs have been obtained for approximately half of the genes in Arabidopsis. BLASTN alignment of coding sequences with Arabidopsis ESTs in GenBank showed that for genes duplicated between chromosomes 2 and 4, 30% of those on chromosome 2 are tagged, compared with 45% on chromosome 4. For genes that are located on chromosomes 2 and 4 and for which copies are also found elsewhere in the genome, the percentages are roughly the same (26 and 51%, respectively), whereas of the genes on chromosome 2 for which no copy could be found, 43% are tagged compared with 37% on chromosome 4.
Patchwork Distribution of Duplications
The Majority of the Arabidopsis Genome Is Found in Duplications
The results presented here show that the Arabidopsis genome contains megabase-sized blocks on pairs of chromosomes in which as many as 45% of the gene pairs show highly similar sequences. They also demonstrate that a large part of the genome results from duplication. This observation is surprising considering the small size of the genome but confirms and considerably extends previous observations based on mapping data ( The exact extent of this duplication will become clear only when the complete genome sequence has been established. For regions in which gaps remain to be sequenced, limited rearrangements possibly could be detected, although ongoing sequencing seems to confirm and extend our results. However, the detailed analysis presented here shows that the duplication of large regions has been followed by extensive rearrangement and probably divergent evolution of the genes for which no sequence similarity can be detected elsewhere in the genome. In fact, our results, which indicate that >60% of the genome is found as duplications, provide only a minimum estimate. During these studies, we detected several short duplicated regions, containing only three or four genes, that are not shown in Fig 4. In addition, comparison of sequences of duplicated genes brought to light several obvious errors in annotation of the corresponding BAC sequences in international databases (G. Blanc, R. Guyot, R. Cooke, and M. Delseny, manuscript in preparation)including erroneously annotated tRNA genes, additional or missing exons, and genes that have not been annotated in one of the copies. These errors certainly lead to an underestimation of the extent of gene sequence conservation when BLASTN alignment of predicted coding sequences is used.
Ab initio analysis of genomic sequence, based largely on computer-assisted prediction of exons, introns, and gene models, is still relatively inefficient in predicting whole-gene models (
In light of these observations, the fact that the sequence of the genome is not yet complete, and given that the nucleolar organizing region and the pericentromeric and telomeric regions represent ~7 Mb, almost all of the "single copy" sequences of Arabidopsis appear to be found in regions resulting from ancient rearrangements. These results lead to the intriguing possibility that Arabidopsis could be a degenerate tetraploid.
Several observations suggest that these duplications are ancient events. First, the sequence of some genes has apparently diverged to the extent that no sequence similarity can be detected, although the positions of these genes in the duplicated regions strongly suggest that they are derived from a common ancestral sequence. Moreover, we have shown that some genes in duplicated regions have apparently been repositioned by transposition events since the original duplication occurred, but this is not the case for all of the genes, and the fact that many divergent regions are of similar lengths argues more favorably for divergent evolution of a common ancestral sequence than for replacement by transposition. Second, considerable sequence divergence has occurred in noncoding regions, to the extent that intron sequences, for example, vary greatly both in sequence and in length and in some cases are absent from one of the copies. This divergence is in striking contrast, for example, to the high degree of conservation of both exon and intron sequences for human and mouse ( In some cases, we observe considerable size differences between two duplicated regions. For example, the only duplicated regions between chromosomes 1 and 4 have lengths of 216 and 465 kb, respectively, and a 787-kb region of chromosome 4 is duplicated as a 1831-kb region on chromosome 5. Such extensions apparently have several origins. If we consider the former duplication, the gene number has increased (73 predicted genes in the 216-kb region of chromosome 1 and 108 in the 465-kb region of chromosome 4); however, intergenic regions have also probably increased because, assuming that most of the genes have been predicted, then one can calculate that the gene density is 1 per every 2.9 kb in the region on chromosome 1 and 1 per every 4.3 kb on chromosome 4. The increase in gene number also results from tandem duplication: only five genes are duplicated in tandem in the 216-kb region of chromosome 1 but 25 in the corresponding 465-kb region on chromosome 4. An unexpected observation regarding genes in duplicated regions is the bias in expression between duplicated genes and apparently single-copy genes and also between the copies on different chromosomes. It is true that our analysis is based on EST sequences, which contain tags to no more than half of the estimated 20,000 to 25,000 genes. However, a comparison of gene pairs clearly shows that many more genes have been tagged on chromosome 4 than the corresponding genes on chromosome 2. This bias in expression could indicate that certain chromosomes or regions of chromosomes contain a greater density of pseudogenes, although little evidence is available to suggest the presence in the Arabidopsis genome of large numbers of pseudogenes. The highly conserved exonintron structure of untagged genes is also an indication that these genes are in fact expressed. Another possibility is that the presence of at least two copies of a gene has allowed specialization of one of the two genes and that one is expressed only under conditions that have not yet been studied with ESTs. If this is the case, however, it is not clear why there should be a bias of expression in favor of genes on one chromosome over another.
This study sheds new light on Arabidopsis genome fluidity. It illustrates that during the evolution of this genome numerous rearrangements have occurred, including duplication, translocation, inversion, and deletion. All of these mechanisms were also probably at work in many species until heterologous chromosome pairing and recombination were prevented by specific mechanisms (
Bacterial artificial chromosome (BAC) contigs were constructed using Sequencher (Gene Codes Corp., Ann Arbor, MI). Dot plot analysis was conducted with the DOTTER program ( Names and GenBank accession numbers of the BACs given in Fig 1 and Fig 2 are as follows: F5H14, AC006234; F26H11, AC006264; F7O24, AC007142; F3K23, AC006841; F2G1, AC007119; F7D8, AC007019; T16B14, AC007232; T26C19, AC007168; and F14M13, AC006592 on chromosome 2; and F20M13, AL035540; T9A14, AL035656; F19H22, AL035679; T22F8, AL050351; F23K16, AL078620; T19P19, AL022605; and T5J17, AL035708 on chromosome 4. Names and GenBank accession numbers of the BACs given in Fig 3 are as follows: F7H1, AC007134; F16F14, AC007047; F24H14, AC006135; MSF3, AC005724; F23N11, AC007048; F5H14, AC006234; T26C19, AC007168; F14M13, AC006592; T9I22, AC006340; F26B6, AC003040; F27L4, AC004482; and T19L18, AC004747 on chromosome 2; F25I24, AL049525; T1P17, AL049730; T20K18, AL049640; T10I14, AL021712; F7K2, AL033545; T32A16, AL078468; F22K18, AL035356; L73G19, AL050400; F14M19, AL049480; T27E11, AL049770; T13J8, AL035524; F9N11, AL109796; F17I23, AF160182; T10C21, AL109787; F26P21, AL031804; F4I10, AL035525; F10M10, AL035521; T4L20, AL023094; ATAP22, Z99708; F20D10, AL035538; F20M13, AL035540; and T5J17, AL035708 on chromosome 4; and K2I5, AB025613; MXC20, AB009055; and MJB24, AB019233 on chromosome 5. Names and GenBank accession numbers of the BACs given in Fig 4 are as follows: F10O3, AC006550; F21B7, AC002560; F19P19, AC000104; F21M11, AC003027; F22O13, AC003981; F14J9, AC003970; F12F1, AC002131; F14L17, AC012188; T15D22, AC012189; T24D18, AC010924; T7N9, AC000348; F3M18, AC010155; T19E23, AC007654; F27J15, AC016041; T6H22, AC009894; F25P12, AC009323; F24O1, AC003113; T1F15, AC004393; and F18B13, AC009322 on chromosome 1; F10A8, AC006200; T8K22, AC004136; F16F14, AC007047; T19L18, AC004747; T22O13, AC007290; F4P9, AC002332; T1B8, U78721; T20F21, AC006068; F11F19, AC007017; and F19D11, AC005310 on chromosome 2; F28J7, AC010797; T6K12, AC016829; F8A24, AC015985; F26K24, AC016795; MBK21, AB024033; MOE17, AB025629; MIL23, AB019232; MJL12, AB026647; F18N11, AL132953; F26O13, AL133452; T25B15, AL132972; and T17J13, AL138651 on chromosome 3; T14P8, AF069298; F9H3, AF071527; F25I24, AL049525; T1P17, AL049730; FCA0, Z97335; FCA4, Z97339; FCA8, Z97343; T13K14, AL080282; F7K2, AL033545; T32A16, AL078468; F22K18, AL035356; T27E11, AL049770; T13J8, AL035524; and T5J17, AL035708 on chromosome 4; and MOK16, AB005240; MUA22, AB007650; F6B6, AP000368; K9L2, AB011475; K23L20, AB016874; MNJ7, AB025628; K2I5, AB025613; MJB24, AB019233; MRG7, AB012246; MHF15, AB006700; F2O15, AB025604; and K9I9, AB013390 on chromosome 5.
This work strongly benefited from the public effort coordinated by the Arabidopsis Genome Initiative to make available Arabidopsis genomic sequences as soon as they were sequenced. We also acknowledge support of several European Union grants, which helped to make our research possible. Received January 12, 2000; accepted May 17, 2000.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol. 215:403-410[CrossRef][ISI][Medline].
Ansari-Lari, M.A., Oeltjen, J.C., Schwartz, S., Zhang, Z., Muzny, D.M., Lu, J., Gorrell, J.H., Chinault, A.C., Belmont, J.W., Miller, W., and Gibbs, R.A. (1998) Comparative sequence analysis of a gene-rich cluster at human chromosome 12p13 and its syntenic region in mouse chromosome 6. Genome Res. 8:29-40
Axelos, M., Bardet, C., Liboz, T., Le Van Thai, A., Curie, C., and Lescure, B. (1989) The gene family encoding the Arabidopsis thaliana translation elongation factor EF-1 Bevan, M. et al. (1998) Analysis of 1.9 Mb of contiguous sequence from chromosome 4 of Arabidopsis thaliana. Nature 391:485-488[CrossRef][Medline]. Cavell, A.C., Lydiate, D.J., Parkin, I.A., Dean, C., and Trick, M. (1998) Collinearity between a 30-centimorgan segment of Arabidopsis thaliana chromosome 4 and duplicated regions within the Brassica napus genome. Genome 41:62-69[Medline].
Conner, J.A., Conner, P., Nasrallah, M.E., and Nasrallah, J.B. (1998) Comparative mapping of the Brassica S locus region and its homeolog in Arabidopsis: Implications for the evolution of mating systems in the Brassicaceae. Plant Cell 10:801-812 Cooke, R. et al. (1996) Further progress towards a catalogue of all Arabidopsis genes: Analysis of a set of 5000 non-redundant ESTs. Plant J. 9:101-124[CrossRef][ISI][Medline]. Cooke, R., Raynal, M., Laudie, M., and Delseny, M. (1997) Identification of members of gene families in Arabidopsis thaliana by contig construction from partial cDNA sequences: 106 genes encoding 50 cytoplasmic ribosomal proteins. Plant J. 11:1127-1140[CrossRef][Medline]. Copenhaver, G.P., and Pikaard, C.S. (1996) Two-dimensional RFLP analyses reveal megabase-sized clusters of rRNA gene variants in Arabidopsis thaliana, suggesting local spreading of variants as the mode for gene homogenization during concerted evolution. Plant J. 9:273-282[CrossRef][ISI][Medline]. Etzold, T., Ulyanov, U., and Argos, P. (1996) SRS: Information retrieval system for molecular biology data banks. Methods Enzymol. 266:114-128[ISI][Medline].
Gale, M.D., and Devos, K.M. (1998) Plant comparative genetics after 10 years. Science 282:656-659 Höfte, H. et al. (1993) An inventory of 1152 expressed sequence tags obtained by partial sequencing of cDNA from Arabidopsis thaliana. Plant J. 4:1051-1061[CrossRef][ISI][Medline]. Kaneko, T., Katoh, T., Sato, S., Nakamura, Y., Asamizu, E., Kotani, H., Miyajima, N., and Tabata, S. (1999) Structural analysis of Arabidopsis thaliana chromosome 5. IX. Sequence features of the regions of 1,011,550 bp covered by seventeen P1 and TAC clones. DNA Res. 6:183-195[Abstract]. Kowalski, S.P., Lan, T.H., Feldmann, K.A., and Paterson, A.H. (1994) Comparative mapping of Arabidopsis thaliana and Brassica oleracea chromosomes reveals islands of conserved organization. Genetics 138:499-510[Abstract].
Krebbers, E., Seurinck, J., Herdies, L., Cashmore, A.R., and Timko, M.P. (1988) Determination of the processing sites of an Arabidopsis 2S albumin and characterization of the complete gene family. Plant Physiol. 87:859-866 Kurkela, S., and Borg-Franck, M. (1992) Structure and expression of kin2, one of two cold- and ABA-induced genes of Arabidopsis thaliana. Plant Mol. Biol. 19:689-692[CrossRef][ISI][Medline]. Lin, X. et al. (1999) Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402:761-768[CrossRef][Medline]. Mayer, K. et al. (1999) Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature 402:769-777[CrossRef][Medline]. McGrath, J.M., Jansco, M.M., and Pichersky, E. (1993) Duplicate sequences with similarity to expressed genes in the genome of Arabidopsis thaliana. Theor. Appl. Genet. 86:880-888[CrossRef][ISI]. Membre, N., Berna, A., Neutelings, G., David, A., David, H., Staiger, D., Saez-Vasquez, J., Raynal, M., Delseny, M., and Bernier, F. (1997) cDNA sequence, genomic organization and differential expression of three Arabidopsis genes for germin/oxalate oxidase-like proteins. Plant Mol. Biol. 35:459-469[CrossRef][ISI][Medline]. Moore, G. (1998) To pair or not to pair: Chromosome pairing and evolution. Curr. Opin. Plant Biol. 1:116-122[Medline]. Newman, T. et al. (1994) Genes galore: A summary of the methods for accessing the results of large-scale partial sequencing of anonymous Arabidopsis thaliana cDNA clones. Plant Physiol. 106:1241-1255[Abstract]. Ohno, S. (1973) Ancient linkage groups and frozen accidents. Nature 244:259-262. Osborn, T.C., Kole, C., Parkin, I.A., Sharpe, A.G., Kuiper, M., Lydiate, D.J., and Trick, M. (1997) Comparison of flowering time genes in Brassica rapa, B. napus and Arabidopsis thaliana. Genetics 146:1123-1129[Abstract]. Paterson, A.A. et al. (1996) Towards a unified genetic map of higher plants transcending the monocot dicot divergence. Nat. Genet. 14:380-382[CrossRef][ISI][Medline]. Romero, I., Fuertes, A., Benito, M.J., Malpica, J.M., Leyva, A., and Paz-Ares, J. (1998) More than 80R2R3-MYB regulatory genes in the genome of Arabidopsis thaliana. Plant J. 14:273-284[CrossRef][ISI][Medline]. Rounsley, S.D., Ditta, G.S., and Yanofsky, M.F. (1995) Diverse roles for MADS box genes in Arabidopsis development. Plant Cell 7:1259-1269[Abstract]. Rouze, P., Pavy, N., and Rombauts, S. (1999) Genome annotation: Which tools do we have for it? Curr. Opin. Plant Biol. 2:90-95[CrossRef][ISI][Medline]. Skrabanek, L., and Wolfe, K.H. (1998) Eukaryotic genome duplicationWhere's the evidence? Curr. Opin. Genet. Dev. 8:694-700[CrossRef][ISI][Medline]. Sonnhammer, E.L.L., and Durbin, R. (1995) A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene 167:1-10[CrossRef][ISI][Medline]. Terryn, N. et al. (1999) Evidence for an ancient chromosomal duplication in Arabidopsis thaliana by sequencing and analyzing a 400-kb contig at the APETALA2 locus on chromosome 4. FEBS Lett. 445:237-245[CrossRef][ISI][Medline]. van Lijsebettens, M., Vanderhaeghen, R., De Block, M., Bauw, G., Villarroel, R., and Van Montagu, M. (1994) An S18 ribosomal protein gene copy at the Arabidopsis PFL locus affects plant development by its specific expression in meristems. EMBO J. 13:3378-3388[ISI][Medline]. Williams, M.E., and Sussex, I.M. (1995) Developmental regulation of ribosomal protein L16 genes in Arabidopsis thaliana. Plant J. 8:65-76[CrossRef][ISI][Medline]. Wolfe, K.H., and Shields, D.C. (1997) Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708-713[CrossRef][Medline].
This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||