|
|
||||||||
|
American Society of Plant Biologists
Identification, Analysis, and Utilization of Conserved Ortholog Set Markers for Comparative Genomics in Higher PlantsDepartment of Plant Breeding and Department of Plant Biology, Cornell University, Ithaca, New York 14853 1 To whom correspondence should be addressed. E-mail sdt4{at}cornell.edu; fax 607-255-6683
We have screened a large tomato EST database against the Arabidopsis genomic sequence and report here the identification of a set of 1025 genes (referred to as a conserved ortholog set, or COS markers) that are single or low copy in both genomes (as determined by computational screens and DNA gel blot hybridization) and that have remained relatively stable in sequence since the early radiation of dicotyledonous plants. These genes were annotated, and a large portion could be assigned to putative functional categories associated with basic metabolic processes, such as energy-generating processes and the biosynthesis and degradation of cellular building blocks. We further demonstrate, through computational screens (e.g., against a Medicago truncatula database) and direct hybridization on genomic DNA of diverse plant species, that these COS markers also are conserved in the genomes of other plant families. Finally, we show that this gene set can be used for comparative mapping studies between highly divergent genomes such as those of tomato and Arabidopsis. This set of COS markers, identified computationally and experimentally, may further studies on comparative genomes and phylogenetics and elucidate the nature of genes conserved throughout plant evolution.
In the past 10 years, we have seen great progress in linking plant genomes through comparative genetic maps, especially for species belonging to the same family (for review, see Paterson et al., 2000 In all of these instances, the species within families have been linked by a common set of orthologous genes detected through DNA gel blot hybridization. The ability to detect single-copy orthologous genes among plant genomes has permitted comparative plant genomics to advance as rapidly as it has. By contrast, during this same period, relatively little progress was made in comparative genomics among more divergent plant species, that is, those belonging to different plant families. Evolutionary divergence time among plant families is greater, allowing for more genomic rearrangements. Moreover, comparisons between plant families have been impeded further by the technical difficulties in identifying conserved orthologous genes that can be used to link these plant genomes. Specifically, reduced gene similarities between plant families have made comparative mapping, via common probes and DNA gel blot hybridization, difficult at best and often impossible. As a result, at present, there is no framework for clearly interpreting genomic similarities among higher plants.
With the Arabidopsis genome having been sequenced and major genomic efforts under way on other plant species (National Science Foundation Plant Genome Research Program [http://www.nsf.gov/bio/dbi/dbi_pgr.htm]; Pennisi, 1998
We have attempted to remedy this situation and provide the basis for more robust comparative genomics and phylogenetic studies across plant taxa by identifying a set of genes conserved throughout evolution in both sequence and copy number. This set of >1000 conserved genes, which we refer to as conserved ortholog set (COS) markers, was identified by computationally comparing the Arabidopsis genomic sequence with the EST database of tomato, which comprises 130,000 ESTs representing approximately half of the tomato gene content (Van der Hoeven et al., 2002
Tomato and Arabidopsis are both dicots, but they belong to different families (Brassicaceae and Solanaceae) that diverged early in flowering plant evolution,
Selection of COS Markers The 1025 COS markers described here were identified initially by manually screening tomato EST sequences against the Arabidopsis BAC tiling path database (http://www.Arabidopsis.org) using the criteria described in Methods. The purpose of this screen was to identify single-copy tomato genes that have a single best match to one region of the Arabidopsis genome and hence would qualify as potential orthologs. This method inherently selects against multigene families for which orthologs between specific genes may not be readily distinguishable. Genes meeting these criteria are referred to herein as putative orthologs, with the disclaimer that we recognize that these data are not sufficient to prove orthology in the strictest evolutionary context, but nevertheless they can be a useful tool. To obtain the 1025 COS markers described here, >20,000 tomato ESTs were screened as described above. The COS markers described were identified and characterized during the past 2 years, during which time the Arabidopsis genome and tomato ESTs were being sequenced. To standardize all results, the COS marker set was rescreened against both the current Arabidopsis tiling path and the tomato EST/unigene set as of April 2001. To estimate the percentage of tomato unigenes that meet COS criteria, the entire tomato unigene set was rescreened against the Arabidopsis tiling path at the same time.
Of the 27,000 tomato unigenes, 55% had a match to the Arabidopsis genome with a tBLASTX score of <E-15. Of those, 16% met the second criterion (no close second match in the Arabidopsis genome). Hence,
Annotation of COS Markers and Functional Role Categories The significant sequence conservation between the COS markers and Arabidopsis genes, coupled with the fact that these genes are single or low copy in both species, raises the possibility of conserved functional roles in both species and potentially in all plant species. Therefore, the COS marker genes may fulfill roles that are universally important to all plant species. In addition, these genes have remained stable during the course of plant speciation, suggesting that many of them were present in a similar form before the radiation of plant species. The COS markers were searched (BLASTX) against the GenBank protein database maintained by the National Center for Biotechnology Information, and the results were used for annotation and assignment to functional role categories (r.c.). It is important to realize that this analysis was limited in its scope with respect to the collection of ESTs used to initially identify the COS markers and will reflect, to a great extent, the types of genes in the database that have been characterized previously. In addition, the sequence information for each COS marker represents on average 553 high-quality nucleotides (range of 178 to 832 nucleotides) from the 5' end of the gene transcript. As a result of the variation in EST length, expect value scores of sequence similarity searches against Arabidopsis cannot be used reliably to compare the degree of sequence conservation between COS markers. The complete COS marker annotation and functional categorization are available online (http://sgn.cornell.edu). For the majority (751; 73%) of the 1025 COS markers, the most significant match was against predicted or characterized genes in Arabidopsis. Solanaceous species or other plant species represented another 151 (14%) and 109 (11%) best matches, respectively. In only 15 instances (2%), nonplant species provided the best match. However, when these markers were analyzed again by BLASTX (tBLASTX) against the Arabidopsis genome (versus the predicted gene set), a more significant Arabidopsis match could be found. These instances most likely represent previously unidentified genes in the Arabidopsis genome. This result also is consistent with the fact that the COS markers were selected initially based on the screening of tomato ESTs against the entire Arabidopsis genomic sequence, rather than only the predicted Arabidopsis gene set. These results also demonstrate the utility of non-Arabidopsis EST databases in the further annotation of the Arabidopsis genome. Of the 1025 COS markers, 514 (50%) had matches to genes or sequences of unknown function and hence were assigned to the unclassified (r.c. 99) role category. Another 76 (7%) were placed in the classification unclear (r.c. 98) category. The classification unclear (r.c. 98) category contains a significant number of COS markers with matches against genes that have been described previously, but uncertainty about their putative function prevented categorization. These include COS markers with matches against a number of cytochrome P450s and transferases for which the target substrates are unclear and genes that are known to be expressed only under certain environmental conditions (e.g., auxin induced) and developmental stages or in specific tissues. The remaining 435 COS markers (42%) were assigned to various functional role categories based on significant matches to proteins already assigned functional roles (Figure 2) .
A large proportion (42%) of the 435 assigned COS markers represent genes that appear to be involved in basic metabolic processes, such as energy-generating processes and the biosynthesis and degradation of cellular building blocks. Genes involved with the cellular transcriptional and translational machinery represent 17% of the assigned COS markers, those involved in protein processing and destination represent 14%, and those involved in signal transduction represent 9%. These types of genes, representing many aspects of plant cellular processes and metabolism of cellular structural components, are part of the set of genes that have remained highly conserved across plant species and at an approximately equal copy number since the divergence of Arabidopsis and tomato 100 to 150 million years ago (Yang et al., 1999
Use of COS Markers in Comparative Plant Genomics The second problem is that many Arabidopsis genes do not hybridize with tomato genomic DNA under standard stringency conditions, and in the cases in which hybridization was detected, the signals often were weak, making interpretation difficult. DNA gel blot hybridization works well when sequences share >70% nucleic acid similarity, but this threshold often is violated when making comparisons across plant families as distant as those of tomato and Arabidopsis. By computationally identifying putative ortholog sets (composed of a single tomato EST and its best Arabidopsis match), one can use the tomato probe/sequence for mapping on tomato, resulting in clear results with DNA gel blots.
Currently, we have mapped >550 COS markers in tomato and expect to map up to 1000 to elucidate the syntenic relationships between these two genomes; the results from this study will be the topic of a future publication. The current COS marker map can be viewed at http://sgn.cornell.edu. However, what we have discovered to date is as follows: (1) the COS markers can reveal segments of conserved linkage between these two genomes; (2) the size of these conserved segments usually is restricted to <10 centimorgan (Figure 3)
; and (3) polyploidization events that occurred both before and near the time that plant families radiated (including Solanaceae and Brassicaceae) have resulted in networks of synteny both within and between plant genomes (Ku et al., 2000
Strategies for Using COS Markers for Comparative Mapping in Other Plant Species Direct Hybridization Several strategies can be imagined for using the COS marker sequences for comparative mapping in other plants. First, because the COS markers were selected to be both highly conserved and single/low copy, it is possible that some portion of them may be useful directly as hybridization probes for restriction fragment length polymorphism mapping in other species. Depending on whether the species in question is more closely related to tomato or Arabidopsis, one might choose either the tomato or the Arabidopsis probe. To test this possibility and to determine whether these COS markers are single/low copy in most other plant genomes, we constructed a "garden blot" composed of DNA from a wide range of plant species (Figures 1 and 4) . The blots were probed with the COS markers listed in Table 2, first with a tomato EST clone corresponding to the COS marker and then with the counterpart Arabidopsis COS probe. The COS markers selected for testing were among the most conserved (at the amino acid level) based on tomato-Arabidopsis comparisons. Hybridization results for two of the nine tested COS markers are depicted in Figure 4. Although these were selected for display based on the quality of the DNA gel blots, the qualitative results are representative.
Two aspects of these hybridization results are worth noting. (1) In the majority of cases, both the tomato and Arabidopsis COS probes detected single- or low-copy genes in most of the species tested (Figure 4). The only exception was COS1358, for which the Arabidopsis probe hybridized to three to seven restriction fragments in many of the genomes, reflecting a small gene family (data not shown). (2) Both the tomato and Arabidopsis probes detected many if not most of the same fragments in the genomes to which they both hybridized (Figure 4). For example, with both COS1039 and COS1263, the tomato probe and the Arabidopsis probe detected nearly identical restriction fragments in the lanes for which hybridization was detected (Figure 4). However, the tomato probe gave a much stronger signal, not only with tomato but also with other species in the Solanaceae family (e.g., pepper and eggplant). Lettuce and sunflower gave weak signals (or no signal) with all probes (both tomato and Arabidopsis), a result possibly attributable to insufficient DNA loading and/or quality of DNA for these samples (Figure 4). Rice was the only monocot included in the survey; in a number of instances, it showed clear hybridization signals with both Arabidopsis and tomato probes (Figure 4). The combined results from these hybridization experiments suggest that at least the more conserved COS markers can be used directly as hybridization probes for restriction fragment length polymorphism mapping. The advantage of this strategy is that species that do not have sequence databases at present (either genomic or ESTs) still can be mapped with some COS markers. However, it is important to note that the COS markers chosen for hybridization experiments were those with the highest tomato-Arabidopsis tBLASTX values. Hence, it is possible that less conserved COS markers may be less useful for direct mapping through hybridization.
Computational Screens with COS Marker Consensus Sequences We tested this strategy by using the tomato sequences for 10 COS markers as queries against the Medicago truncatula unigene database, which is one of the largest EST databases for a dicot species, containing >30,000 tentative consensus sequences (http://www.tigr.org). As a control, the most similar M. truncatula unigene, identified by screening with the tomato COS sequence, was screened against the Arabidopsis BAC tiling path (tBLASTX). The goal was to determine whether the same Arabidopsis BAC would be identified by the M. truncatula EST sequence that was identified by the tomato during the original screen for COS markers (see above). Table 1 lists the tBLASTX expect values for the top three M. truncatula EST matches to each tomato COS sequence. In all cases, M. truncatula ESTs with highly significant matches to each COS sequence were identified (Table 1). Furthermore, when a COS marker had only one significant match in Arabidopsis, it had only one significant match in M. truncatula as well. In all 10 cases, using the M. truncatula counterpart of each COS marker to screen the Arabidopsis tiling path identified the same segment of the same Arabidopsis BAC that was identified originally using the tomato EST. In the majority (6 of 10) of these, this Arabidopsis BAC was the most significant hit; in the other four cases, it was one of the top four most significant hits.
Three-way alignments were made for each set of tomato, Arabidopsis, and M. truncatula putative orthologs to determine the relative divergence among the three (Figure 5)
. In addition, pairwise distances (shown as mean character differences) were calculated for these three sets of COS markers using PAUP software (Swofford, 1999
For each COS set, the level of amino acid sequence divergence among tomato, M. truncatula, and Arabidopsis was similar, despite the fact that the tomato lineage is thought to have diverged before the M. truncatulaArabidopsis lineage diverged (Figure 1). However, a comparison of divergence values among COS sets showed remarkable variation among these gene sets. For example, for COS1335, the divergence values for pairwise comparisons of tomato, M. truncatula, and Arabidopsis ranged from 0.162 to 0.185; the values for COS1358 ranged from 0.120 to 0.134; and the values for COS94 ranged from 0.051 to 0.074 (Table 3). Although the computational screen was limited to only a few COS markers and against a single database (M. truncatula), these results, combined with the DNA gel blot hybridization results (see above), suggest that orthologous counterparts to many if not most COS markers exist in the genomes of other plant species. As plant EST (and genomic) databases expand, the computational approach to finding the COS marker counterparts in plant genomes will increase. Eventually, it may be possible to identify sufficiently conserved consensus primers that could be used to amplify COS markers from a wide variety of plant genomes. This would facilitate mapping in plant genomes that lack genomic/EST databases and also could be used to generate multiple sequence comparisons across plant taxa for phylogenetic reconstructions.
Once a number of species have been sequenced fully, it will be possible to computationally classify the corresponding proteins into probably clusters of orthologs and paralogs (Tatusov et al., 1997 Second, because plant genomes have experienced extensive gene duplication events, most genes belong to multigene families. Thus, orthologs may not be distinguished easily from paralogs. This is why we required that there be only one best match in the Arabidopsis genome when screening for putative orthologs of tomato genes. Here, a large EST database from one plant species has been screened computationally against the Arabidopsis genome and tested experimentally in a manner that could yield a large set of genes that have a high probability of being orthologs. Although there are databases/algorithms, such as TOGA (available at http://www.tigr.org/tdb/toga/orth_search.shtml), that can search for and cluster homologous sequences across multiple genome databases, the results from these analyses do not automatically distinguish between paralogs and orthologs. Although straightforward and useful for gene alignments, such an approach for establishing orthology is highly risky. The COS markers reported here can be used for comparative mapping studies between highly divergent genomes such as those of tomato and Arabidopsis. The consensus sequences of COS markers (from tomato-Arabidopsis alignments) also can be used to search genome databases of other plants to find corresponding putative orthologous genes. Therefore, these COS markers may be useful for comparative mapping across plant families and may facilitate the development of the syntenic networks across plant taxa necessary for understanding the evolution of genes, genomes, and gene functions. This set of COS markers also may serve as the basis for extending plant phylogenetic studies that are limited at present by the availability of genes for which putative orthologs can be identified readily across plant taxa.
Tomato EST Database The tomato (Lycopersicon esculentum) EST collection is stored and accessible through the online Solanaceae Genome Network database (http://sgn.cornell.edu ). The EST collection is derived from a variety of >25 different cDNA libraries, capturing genes expressed in different tissue types and developmental stages or expressed during pathogen-elicited responses (Van der Hoeven et al., 2002
Computational Screening of Conserved Ortholog Set Markers
ESTs that met both of these criteria were classified as conserved orthologs; all others were considered potentially paralogous and eliminated. The ESTs selected as conserved orthologs then were screened computationally against the tomato unigene set currently composed of 27,000 contigs and/or singletons (Van der Hoeven et al., 2002 The 10 COS markers with the highest expect values against the Arabidopsis genome also were used to screen the Medicago truncatula EST database (http://www.tigr.org) using tBLASTX.
Mapping of the COS Markers in Tomato
To date, >550 of the COS markers that meet these criteria have been mapped. In addition, 200 restriction fragment length polymorphisms from the tomato high-density map (Tanksley et al., 1992
Hybridization of COS Markers to Other Species
Genomic DNA of each species was digested with EcoRI; Tomato EST clones corresponding to each of the nine selected COS markers were radiolabeled, probed onto filters of these DNA gel blots, and washed at a stringency of 1.0 x SSC at 65°C. After exposure to film, the same blots were stripped and rehybridized with probes from the corresponding region in Arabidopsis. For these probes, genomic Arabidopsis DNA was amplified with primers specific to the coding regions of Arabidopsis that correspond to each of the nine COS markers (Table 2). The blots hybridized with the Arabidopsis probes were washed at a stringency of 1.0 x SSC at 65°C.
Annotation of COS Markers Functional annotation was achieved by assigning functional role categories as developed for the analysis of the Arabidopsis genome and used in conjunction with the numerical index for categories and subcategories as defined by TIGR (http://www.tigr.org). Annotation followed the Munich Information Center for Protein Sequences (http://mips.gsf.de) role categorization. A list of the role categories can be found on the SGN World Wide Web site (http://www.sgn.cornell.edu). Criteria used for role assignment required an approximate expect value of <E-30 against an experimentally characterized gene. However, in cases in which a COS marker matched a number of characterized genes of similar function with an expect value of >E-30, occasionally a role category was assigned.
Special thanks to Dan Ilut, Mark Wright, and Damon Little for computational support, Yimin Xu and Eloisa Tedeschi for technical support, and Nevin Young, Anne Frary, and Todd Vision for reviewing the manuscript. DNA for the garden blots was received from Susan Brown (Geneva Experimental Station, Geneva, NY) (apple), Susan McCouch (Cornell University) (rice), Molly Jahn (Cornell University) (pepper), Shunxue Tang (Oregon State University) (sunflower), and John Yu (U.S. Department of Agriculture, College Station, TX) (cotton). This project was supported by grants from the National Science Foundation (DBI-9872617) and the U.S. Department of Agriculture Plant Genome Program (97-35300-4384).
Article, publication date, and citation information can be found at www.plantcell.org/cgi/doi/10.1105/tpc.010479. Received November 2, 2001; accepted April 18, 2002.
Adam, D. (2000). Now for the hard ones. Nature 408, 792793.[CrossRef][Medline]
Bennetzen, J., et al. (1998). Grass genomes. Proc. Natl. Acad. Sci. USA 95, 19751978. Boutin, S.R., Young, N.D., Olson, T., Yu, Z.-H., Shoemaker, R.C., and Vallejos, C. (1995). Genome conservation among three legume genera detected with DNA markers. Genome 38, 928937. Chase, M.W., Soltis, D.E., Olmstead, R.G., Morgan, D., Les, D.H., Mishler, B.D., Duvall, M.R., Price, R.A., Hills, H.G., and Qiu, Y.-L. (1993). Phylogenetics of seed plants: An analysis of nucleotide sequences from the plastid gene rbcL. Ann. Mo. Bot. Gard. 80, 528580.[CrossRef] Feinberg, A.P., and Vogelstein, B. (1983). A technique for radiolabelling DNA restriction fragments to a high specific activity. Anal. Biochem. 132, 613.[CrossRef][Web of Science][Medline]
Gale, M., and Devos, K. (1998). Comparative genetics in the grasses. Proc. Natl. Acad. Sci. USA 95, 19711974. Gandolfo, M.A., Nixon, K.C., and Crepet, W.L. (1998). A phylogenetic analysis of modern and cretaceous Triuridaceae (Monocotyledoneae). Am. J. Bot. 85, 964974.[Abstract]
Ku, H.M., Vision, T., and Liu, J. (2000). Comparing sequenced segments of the tomato and Arabidopsis genomes: Large-scale duplication followed by selective gene loss creates a network of synteny. Proc. Natl. Acad. Sci. USA 97, 91219126. Lander, E.S., Green, P., Abrahamson, J., Barlow, A., Daly, M.J., Lincoln, S.E., and Newburg, L. (1987). MAPMAKER: An interactive computer package for constructing primary genetic linkage maps of experimental and natural populations. Genomics 1, 174181.[CrossRef][Medline]
Livingstone, K.D., Lackney, V.K., Blauth, J.R., van Wijk, R., and Jahn, M.K. (1999). Genome mapping in Capsicum and the evolution of genome structure in the Solanaceae. Genetics 152, 11831202. Menancio-Hautea, D., Fatokun, C., Kumar, L., Danesh, D., and Young, N. (1993). Comparative genome analysis of mungbean (Vigna radiata [L.] Wilczek) and cowpea (V. unguiculata [L.] Walpers) using RFLP mapping data. Theor. Appl. Genet. 86, 797810.[CrossRef][Web of Science]
Paterson, A.H., Bowers, J.E., Burow, M.D., Draye, X., Elsik, C.G., Jiang, C.-X., Katsar, C.S., Lan, T.-H., Lin, Y.-R., Ming, R., and Wright, R.J. (2000). Comparative genomics of plant chromosomes. Plant Cell 12, 15231540.
Pennisi, E. (1998). A bonanza for plant genomics. Science 282, 652654. Swofford, D.L. (1999). PAUP*: Phylogenetic Analysis Using Parsimony (*and Other Methods), Version 4.0. (Sunderland, MA: Sinauer Associates). Tanksley, S.D., et al. (1992). High density molecular linkage maps of the tomato and potato genomes. Genetics 132, 11411160.[Abstract]
Tatusov, R.L., Galperin, M.Y., Natale, D.A., and Koonin, E.V. (2000). The COG database: A tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 3336.
Tatusov, R.L., Koonin, E.V., and Lipman, D.L. (1997). A genomic perspective on protein families. Science 278, 631636.
Van der Hoeven, R., Ronning, C., and Tanksley, S.D. (2002). Deductions about the number, organization, and evolution of genes in the tomato genome based on analysis of a large expressed sequence tag collection and selective genomic sequencing. Plant Cell 14, 14411456.
Vision, T., Brown, D., and Tanksley, S.D. (2000). The origins of genome duplications in Arabidopsis. Science 290, 21142117.
Wilson, A., et al. (1999). Inferences on the genome structure of progenitor maize through comparative analysis of rice, maize and the domesticated panicoids. Genetics 153, 453473. Yang, Y.W., Lai, K.N., Tai, P.Y., and Li, W.H. (1999). Rates of nucleotide substitution in angiosperm mitochondrial DNA sequences and dates of divergence between Brassica and other lineages. J. Mol. Evol. 48, 597604.[CrossRef][Web of Science][Medline] This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | THE PLANT CELL | PLANT PHYSIOLOGY | |
|---|---|---|---|