Plant Cell BIOBASE Corporation
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


First published online March 13, 2003; 10.1105/tpc.009308

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Data
Right arrow A correction has been published
Right arrow All Versions of this Article:
15/4/809    most recent
tpc.009308v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Related articles in Plant Cell
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via CrossRef
Right arrow Citing Articles via ISI Web of Science (230)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Meyers, B. C.
Right arrow Articles by Michelmore, R. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Meyers, B. C.
Right arrow Articles by Michelmore, R. W.
Agricola
Right arrow Articles by Meyers, B. C.
Right arrow Articles by Michelmore, R. W.
The Plant Cell, Vol. 15, 809-834, April 2003, Copyright © 2003,
American Society of Plant Biologists


GENOMICS ARTICLE

Genome-Wide Analysis of NBS-LRR–Encoding Genes in Arabidopsis

Blake C. Meyersa,b, Alexander Kozika, Alyssa Griegoa, Hanhui Kuanga and Richard W. Michelmore1,a

a Department of Vegetable Crops, University of California, Davis, California 95616
b Department of Plant and Soil Sciences, University of Delaware, Newark, Delaware 19711

1 To whom correspondence should be addressed. E-mail rwmichelmore{at}ucdavis.edu; fax 530-752-9659


    Abstract
 TOP
 Abstract
 INTRODUCTION
 RESULTS
 DISCUSSION
 METHODS
 References
 
The Arabidopsis genome contains ~200 genes that encode proteins with similarity to the nucleotide binding site and other domains characteristic of plant resistance proteins. Through a reiterative process of sequence analysis and reannotation, we identified 149 NBS-LRR–encoding genes in the Arabidopsis (ecotype Columbia) genomic sequence. Fifty-six of these genes were corrected from earlier annotations. At least 12 are predicted to be pseudogenes. As described previously, two distinct groups of sequences were identified: those that encoded an N-terminal domain with Toll/Interleukin-1 Receptor homology (TIR-NBS-LRR, or TNL), and those that encoded an N-terminal coiled-coil motif (CC-NBS-LRR, or CNL). The encoded proteins are distinct from the 58 predicted adapter proteins in the previously described TIR-X, TIR-NBS, and CC-NBS groups. Classification based on protein domains, intron positions, sequence conservation, and genome distribution defined four subgroups of CNL proteins, eight subgroups of TNL proteins, and a pair of divergent NL proteins that lack a defined N-terminal motif. CNL proteins generally were encoded in single exons, although two subclasses were identified that contained introns in unique positions. TNL proteins were encoded in modular exons, with conserved intron positions separating distinct protein domains. Conserved motifs were identified in the LRRs of both CNL and TNL proteins. In contrast to CNL proteins, TNL proteins contained large and variable C-terminal domains. The extant distribution and diversity of the NBS-LRR sequences has been generated by extensive duplication and ectopic rearrangements that involved segmental duplications as well as microscale events. The observed diversity of these NBS-LRR proteins indicates the variety of recognition molecules available in an individual genotype to detect diverse biotic challenges.


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 RESULTS
 DISCUSSION
 METHODS
 References
 
Preliminary sequence analysis suggested that a significant proportion of the Arabidopsis ecotype Columbia (Col-0) genome is devoted to encoding various components of a defense system (Arabidopsis Genome Initiative, 2000Go). We can now evaluate in detail the repertoire of genes available in a single genotype to defend against diverse biotic challenges. Resistance (R) genes have been shown frequently by classic genetics to be single loci that confer resistance against pathogens that express matching avirulence genes in a "gene-for-gene" manner (Flor, 1956Go, 1971Go). This type of specific resistance often is associated with a localized hypersensitive response, a form of programmed cell death, in the plant cells proximal to the site of infection triggered by recognition of a pathogen product (Dangl et al., 1996Go; Heath, 2000Go). The plant resistance response triggered by R gene recognition also includes increased expression of defense genes, generation of reactive oxygen species, production or release of salicylic acid, ion fluxes, and other factors (Heath, 2000Go).

During the last 8 years, numerous R genes have been cloned from many plant species (Dangl and Jones, 2001Go; Hulbert et al., 2001Go). R genes encode at least five diverse classes of proteins (R proteins) (Dangl and Jones, 2001Go). The largest class of known R proteins includes those that contain a nucleotide binding site and leucine-rich repeat domains (NBS-LRR proteins). NBS-LRR proteins may recognize the presence of the pathogen directly or indirectly. In theory, specific recognition of multiple pathogens could necessitate the activity of numerous R genes. The guard hypothesis proposes that NBS-LRR proteins guard plant targets against pathogen effector proteins; in this scenario, these pathogen products act as virulence factors to enhance the susceptibility of the host plant in the absence of recognition (van der Biezen and Jones, 1998aGo; Dangl and Jones, 2001Go). A small number of R genes can provide defense against diverse pathogens if a limited number of effector targets are present. The definition of a complete set of NBS-LRR proteins in a plant genome will provide insights into the diversity of defense genes available in a single plant.

The NBS-LRR R proteins contain distinct domains, several of which are composed of characteristic motifs. Nucleotide binding sites are found in diverse proteins and are required for ATP and GTP binding (Walker et al., 1982Go; Saraste et al., 1990Go). The ability of plant NBS-LRR proteins to bind nucleotides has been demonstrated for the tomato I2 and Mi R proteins (Tameling et al., 2002Go). The NBS contains conserved motifs that can be used to classify the sequences into subgroups with discrete functions (Saraste et al., 1990Go; Bourne et al., 1991Go; Traut, 1994Go). The NBS-LRR plant R proteins are members of a specific and distinct subgroup of NBS proteins that contain additional protein domains, such as a C-terminal LRR region of variable length (Bent, 1996Go; Hammond-Kosack and Jones, 1996Go; Baker et al., 1997Go; van der Biezen and Jones, 1998bGo; Meyers et al., 1999Go; Cannon et al., 2002Go). The NBS-LRR family of proteins has been subdivided further based on the presence or absence of an N-terminal Toll/Interleukin-1 Receptor (TIR) homology region (Meyers et al., 1999Go; Pan et al., 2000Go; Cannon et al., 2002Go; Richly et al., 2002Go). Most of those proteins lacking a TIR have a coiled-coil (CC) motif in the N-terminal region (Pan et al., 2000Go). Detailed comparative analyses of the complete set of Arabidopsis R proteins have not been made.

Genetic and genomic studies have provided insights into the evolution of R genes and the mechanisms that generate variation in these genes. Classic genetic studies demonstrated that many but not all R genes are clustered in plant genomes (reviewed by Hulbert et al., 2001Go). Consistent with this finding, genome sequencing demonstrated that the majority of NBS-LRR–encoding genes are clustered in the genomes of both Arabidopsis and rice (Meyers et al., 1999Go; Bai et al., 2002Go; Richly et al., 2002Go). The clustered arrangement of these genes may be a critical attribute allowing the generation of novel resistance specificities via recombination or gene conversion (Hulbert et al., 2001Go). In addition, analyses of individual clusters provided evidence of diversifying selection in the majority of plant R genes studied, suggesting that variation may be concentrated within predicted binding surfaces (Parniske et al., 1997Go; Botella et al., 1998Go; Meyers et al., 1998bGo; Wang et al., 1998Go; Cooley et al., 2000Go; Luck et al., 2000Go; Mondragon-Palomino et al., 2002Go). The combined data from classic and molecular studies have led to models describing the predicted evolutionary constraints on these proteins and the ways in which variation is produced and maintained (Michelmore and Meyers, 1998Go; Mondragon-Palomino et al., 2002Go). Additional NBS-LRR proteins identified through ongoing genomics projects are contributing to our understanding of the mechanisms that generate sequence diversity in these proteins.

Here, we characterize the complete set of plant R gene–related NBS-encoding genes in the Col-0 Arabidopsis genome. Bioinformatics analysis combined with experimental validation demonstrated the presence of 149 NBS-LRR–encoding genes and an additional 58 related genes lacking LRRs (Meyers et al., 2002Go). As demonstrated previously, the NBS-LRR–encoding genes can be subdivided into two distinct classes: those with or without a TIR region. Numerous subgroups existed in both classes, as defined by intron numbers and positions, phylogenetic analyses, and encoded protein motifs. Their distribution within the Arabidopsis Col-0 genome is the consequence of numerous duplication events and ectopic rearrangements as well as conservation and preferential amplification of particular gene pairs. This bioinformatics analysis of the R gene homologs provides a definitive resource for ongoing functional and evolutionary studies of this large family of plant genes.


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 RESULTS
 DISCUSSION
 METHODS
 References
 
Identification and Classification of NBS-LRR–Encoding Genes
The complete set of NBS-encoding sequences was identified from the Arabidopsis genome of ecotype Col-0 in a reiterative process (Table 1, Figure 1) . Four analytical steps were used to compile the final set of sequences. First, a set of 159 genes with the NBS motif was selected from the complete set of predicted Arabidopsis proteins (http://mips.gsf.de) using a hidden Markov model (HMM) (Eddy, 1998Go) for the NBS domain from the Pfam database (PF0931; http://pfam.wustl.edu).


View this table:
[in this window]
[in a new window]
 
Table 1. Numbers of Arabidopsis Genes That Encode Domains Similar to Plant R Proteins

 


View larger version (45K):
[in this window]
[in a new window]
 
Figure 1. Intron/Exon Configurations and Protein Motifs of NBS-LRR–Encoding Genes in Arabidopsis.

(A) CNL genes.

(B) TNL genes. All members of the variable TNL-A subgroup are shown; only one member of the more homogeneous subgroups is diagrammed.

(C) Additional genes that encode CC, TIR, or NBS domains similar to the CNL or TNL proteins. TN and TX genes are described in more detail by Meyers et al. (2002)Go.

Encoded protein domains are indicated with shading and colors. Exons are drawn approximately to scale as boxes; connecting thin lines indicate the positions of introns, which are not drawn to scale. Numbers above introns indicate the phase of the intron (see text). Numbers under "# in Col-0" indicate the total number found in the Col-0 genomic sequence; the "representative" columns list the diagrammed gene for each type. Genes of known function are shown where available.

 
In the second analytical step, selected protein sequences were aligned based only on the NBS domain using CLUSTAL W. This alignment then was used to develop an Arabidopsis-specific HMM model to identify related sequences. The refined HMM was compared again against the complete set of predicted Arabidopsis proteins. All sequences that matched the model with a score of 0.05 or greater were incorporated into the HMM. The refined HMM was compared again with the entire set of Arabidopsis open reading frames (ORFs) with the threshold for acceptance decreased to 0.001. The 10 sequences with scores just above this threshold and the 15 sequences with scores just below this threshold were analyzed for the presence of the TIR, NBS, or LRR motifs using Pfam and visual inspection. Four of the 10 sequences just above the 0.001 threshold value did not contain TIR, NBS, or LRR motifs and were discarded; all sequences above these 10 contained NBS motifs. Below this threshold, only 2 of the next 15 proteins contained the NBS motif by Pfam analysis and therefore were retained in the analysis. The remaining 13 low-scoring proteins were either predominantly LRRs or were receptor-like kinases; all lacked any recognizable NBS motifs. This analysis identified 194 annotated genes that encoded homologs of NBS-LRR R proteins.

In the third step, we performed TBLASTN analyses using eight sequences selected to represent the diversity of NBS-LRR proteins to search the entire Arabidopsis genomic sequence to ensure that there were no additional related genes that had not been identified as ORFs by the automated annotation. All resulting sequences in the BLAST (Basic Local Alignment Search Tool) output (up to E = 1.0) were assessed manually for the presence of TIR, NBS, LRR, or R protein–like CC domains. This procedure identified four additional sequences. Finally, manual reannotation, intron/exon analysis, and protein motif comparisons were performed on all of the selected sequences to correct misannotation (as described below). Combined, these analyses identified 207 distinct genes encoding R protein–like TIR, CC, and NBS-LRR domains.

The predicted proteins encoded by these genes were classified initially based on Pfam protein motif analyses (Table 1). We restricted our current analyses to the 149 genes that encode both NBS and LRR domains because the LRR motif is present in diverse proteins unrelated to plant R genes. These 149 NBS sequences included 11 cloned R genes or the closest Col-0 homologs to R genes cloned from other Arabidopsis ecotypes. The additional 58 Arabidopsis genes identified during our search, most of which encode TIR motifs but not LRRs, have been described elsewhere (Meyers et al., 2002Go).

Detailed information about these NBS-encoding sequences is presented in our online database (http://www.niblrrs.ucdavis.edu). This database of NBS sequences includes links to the MIPS and TIGR Arabidopsis databases, gene locations, Pfam analyses of motifs, EST matches, and FASTA results for these sequences compared with either the complete Arabidopsis genome or the GenBank nonredundant set.

Predicted Pseudogenes and Annotation Errors Identified by Manual Reannotation
The initial sequence comparisons indicated that numerous NBS-LRR sequences had been partially misannotated during the automated annotation process. The automated annotations available in GenBank, MIPS, and TIGR represent powerful and useful initial attempts at annotation but generally have not been verified and corrected for individual genes and gene families (Haas et al., 2002Go). Therefore, we undertook the complete manual reannotation and analysis of the NBS-LRR gene family to rectify incorrect start codon predictions, splicing errors, missed or extra exons, fused genes, split genes, and incorrectly predicted pseudogenes. Nonfunctional genes, or "pseudogenes," were predicted on the basis of frameshift mutations or premature stop codons (Table 2); such reading frame disruptions were not identified by automated annotation programs, which instead inserted introns around the frameshift or nonsense mutations (data not shown). Mutations were identified by comparing DNA and protein sequences and by comparing intron positions and numbers of closely related gene homologs.


View this table:
[in this window]
[in a new window]
 
Table 2. Pseudogenes and Annotation Errors in Arabidopsis CNL and TNL Genes

 
For each gene, the number of introns and their positions relative to encoded protein motifs and domains were determined. Intron positions and numbers generally were consistent with phylogenetic data, allowing the identification of anomalous exons and introns. Introns occurring in nonconserved locations were reanalyzed by BLASTX comparisons using the intron sequence plus ~100 bp of 5' and 3' exon sequences. In 37 genes, either (1) translation and BLAST comparison of a small predicted intron matched the predicted protein sequence of another NBS-LRR protein (indicating that the intron prediction probably was incorrect), or (2) small additional nonconserved exons (<50 bp) were identified for which no similar exons could be found in comparisons with closely related genes (Table 2). In total, our reannotation of the CNL and TNL genes (genes that encode an N-terminal CC motif [CNL] or an N-terminal domain with TIR homology [TNL]) differed from the automated annotation in 56 of 149 genes. Combined with the reannotated TX (TIR-X) and TN (TIR-NBS) genes (Meyers et al., 2002Go), we calculated that ~36% of automated annotations contained errors. This value is consistent with that found in previous large-scale analyses of other Arabidopsis genes (Haas et al., 2002Go).

We amplified by PCR and resequenced genomic DNA from Col-0 to verify experimentally the predicted frameshift and nonsense mutations in the Arabidopsis Col-0 CNL and TNL genes. Our reannotation identified 13 genes for which the translation of a predicted intron sequence encoded protein sequence that matched other NBS-LRR proteins but included either a frameshift or a nonsense mutation (Table 2). We were able to amplify the regions encoding these mutations in 11 of the 13 genes; these 11 predicted pseudogenes contained 14 predicted mutations (Table 2; two sites each in At4g14610, At1g59620, and At4g09360). In 9 of the 11 genes, containing 11 of the 14 putative mutations, the sequences matched perfectly the published genomic sequence, indicating that these genes did contain disrupted reading frames and are likely pseudogenes. Neither of two frameshift mutations predicted in At4g14610 was found in the Col-0 accession that we analyzed, indicating a single complete ORF for this gene and errors in the published sequence. In addition, an error was identified in the sequence and annotation of the TNL gene At4g19500 (Meyers et al., 2002Go).

Additional pseudogenes were predicted as those that lacked specific motifs or contained large deletions even though they had apparently intact ORFs (Table 2). For example, At5g47280 lacks a CC motif in the predicted protein as a result of a deletion at the 5' end of the gene. At5g45210 lacks most of the encoded LRR and C terminus present in other homologs. In the absence of functional data for these genes, it cannot be inferred with certainty whether these are pseudogenes. However, we identified 12 potential pseudogenes with uninterrupted ORFs that had deletions, in addition to the nine predicted pseudogenes with disrupted reading frames (Table 2).

In a few groups of closely related sequences, variable numbers of exons were observed, and these differences could not be attributed to disrupted reading frames or incorrect annotation (Figure 1). Among the CNL genes, At1g61180 and At1g61190 have an additional 3' exon. Greater diversity in exon numbers was observed among the TNL genes than among the CNL genes, with most TNL genes containing four exons and most CNL genes containing only one exon (Figure 1). The Col-0 homologs of the RPP1 genes (Botella et al., 1998Go), including genes At3g44480, At3g44510, At3g44630, At3g44670, and At3g44400, show an unusual exon configuration; some of these genes contain an additional 5' exon and/or 3' exon. Database searches with these genes identified two ESTs, providing evidence of alternative splicing of the exons at the 3' end of the gene. This finding indicates that there may be additional variation in the exon number that cannot be determined without full-length cDNA clones. In addition, we have not considered noncoding exons in the 5' and 3' untranslated regions in this analysis, although among known R genes in Arabidopsis, noncoding exons have been reported only for RPP1 (Botella et al., 1998Go). Analysis of cDNA sequences from the 5' and 3' ends of the NBS-LRR–encoding genes demonstrates that 10 of 80 analyzed genes contain noncoding exons (X. Tan, B. Meyers, and R.W. Michelmore, unpublished data).

Intron Positions and Phases Distinguish Subgroups and Indicate the Modular Nature of TNL Proteins
We analyzed the intron positions and phases in the different subgroups of the 149 CNL and TNL genes and in the closely related genes to assess the diversity within and between each group. Intron phases in spliceosomal introns can be classified based on the position of the intron with respect to the reading frame of the gene: phase-0 introns lie between two codons; phase-1 introns interrupt a codon between the first and second bases; and phase-2 introns interrupt a codon between the second and third bases (Sharp, 1981Go). Intron phases usually are conserved, because a modification of the phase on one side of the intron requires a concordant change at the distal location to maintain the reading frame (Long and Deutsch, 1999Go). Three distinct patterns of intron phases and positions were identified in CN and CNL genes (Figure 1A). These probably reflect the independent acquisition or loss of introns; a fourth pattern exhibited by two genes reflects the addition of a 3' exon separated by a phase-0 intron. A greater degree of variation in the number of introns was observed among TN, TX, and TNL genes, but the positions and phases of individual introns were highly conserved with respect to the protein motifs encoded by flanking exons (Figures 1B and 1C). Much of the variation in intron numbers in the TNL genes was caused by the addition of 3' exons that encode LRR motifs separated by phase-0 introns (Figure 1B). The greater diversity of intron positions and phases in the CN/CNL genes (as opposed to intron and exon numbers) may indicate that this group is more ancient than the TN/TNL gene family. Recent studies also have found shorter branch lengths for phylogenetic trees of TNL genes (Cannon et al., 2002Go), also suggesting that this group may have evolved more recently than the CNL genes.

Conserved Domains and Motifs in CNL and TNL Proteins
The 149 reannotated CNL and TNL genes were translated and subjected to protein domain and motif analyses. The protein analysis programs hmmpfam and hmmsearch (Eddy, 1998Go) were used initially to identify the major domains encoded in these genes. These programs were suitable for defining the presence or absence of the TIR, NBS, and LRR domains, but they could not recognize smaller individual motifs or more dispersed patterns, such as those present in the CC domain. Based on preliminary Pfam analyses of the entire predicted proteins as well as homology with previously described motifs within the NBS (Meyers et al., 1999Go, 2002Go; Cannon et al., 2002Go), we initially divided the 149 genes into two major classes that encode either 55 CC-NBS-LRR or 94 TIR-NBS-LRR proteins. The NBS domain was defined by Pfam analysis; the NBS, N-terminal, and LRR plus C-terminal regions then were analyzed individually using the program MEME (Multiple Expectation Maximization for Motif Elicitation) (Bailey and Elkan, 1995Go). These analyses are described below in the order in which the domains are positioned in the proteins, starting at the N terminus (Figure 1).

The N-Terminal Domain
Immediately adjacent to the translation initiation codon of the majority of TNL proteins, we identified N-terminal amino acid residues similar to those that may enhance gene expression and protein stability. Analysis with MEME identified the motif SSSSSRNWRY N-terminal to the first TIR motif with a score of <e-04 in 67 of 93 proteins classified as TNLs (MEME output 1; see supplemental data online). Similar Ala-polyserine sequences immediately after the N-terminal Met [MA(S)n] have been found in a variety of highly expressed genes, and mutations in these sequences have been shown to reduce reporter protein stability in plants (Sawant et al., 2001Go). Twenty-nine of the 67 TNL proteins with the Ser-rich motif at the N terminus had sequences close to the consensus MA(S)n; an additional 23 more TNL proteins had variants of MA(S)n with several nonconserved substitutions (see supplemental data online). The Ser-rich motif was present in 12 of the closest homologs of RPP28 (At2g14080) (N. Sepahvand, P.D. Bittner-Eddy, and J.L. Beynon, unpublished data); however, it was preceded by an ~40–amino acid N-terminal region containing a unique conserved motif (motif 13 in MEME output 1; see supplemental data online). The three closest homologs to the R gene RPP1 in the ecotype Wassilewskija also encoded motif 13 as well as an additional N-terminal novel motif encoded by a separate 5' exon that was described previously by Botella et al. (1998)Go. No sequences related to MA(S)n were present at the N terminus of CNL proteins.

Several conserved motifs were confirmed that had been identified previously in the TIR domain of plant NBS-LRRs and related proteins (motifs TIR-1, TIR-2, TIR-3, and TIR-4) (Meyers et al., 1999Go, 2002Go). The order of these motifs was well conserved. Previous findings had noted duplications of the TIR motifs in some Arabidopsis proteins (Meyers et al., 1999Go); these unusual proteins in the TNL-A subgroup (Figure 1) are considered in more detail below and by Meyers et al. (2002)Go. Within the group of TNL proteins, only the TNL-A subgroup contained a slight variation on the TIR-A motif (MEME output 1; see supplemental data online). Overall, the TIR motifs of the TNL proteins were essentially as described previously (Meyers et al., 2002Go) and included ~175 amino acids.

The presence of an N-terminal CC domain has been identified as a characteristic motif in the N terminus of the CNL R proteins (Pan et al., 2000Go), and the presence or absence of a CC motif can be anticipated on the basis of characteristic motifs present in the NBS (Meyers et al., 1999Go, 2002Go). We had initially defined the group of 55 CNL proteins based on motifs in the NBS and a lack of TIR motifs (Table 1). Because CC motifs are not defined in the Pfam database, motifs within the N-terminal region of CN and CNL proteins were analyzed using the program COILS (Lupas et al., 1991Go) to assess the positions and prevalence of CC motifs. In total, the CC domain of the CNL proteins spanned ~175 amino acids N terminal to the NBS. The predicted CC motif was positioned from 25 to 50 amino acids from the N terminus in most CNL proteins. There was strong evidence of an N-terminal CC motif in 50 of 55 CNL proteins; evidence for a CC motif was weak in At3g14460. Four proteins (NL proteins [Table 1]) had NBS motifs similar to CNLs but lacked a CC motif. At5g47280 and At1g61310 contained apparent N-terminal deletions that removed the region of the protein in which the CC motif was found in closely related homologs of these proteins. At4g19050 and At5g45510 were divergent NBS-LRR proteins that showed no evidence of a CC motif and contained few amino acids N terminal to the NBS (Figure 1C). Four of five CN proteins had a clear CC motif; At5g45440 did not. Using COILS, CC motifs were not identified in the N terminus of TN or TNL proteins, demonstrating the specificity of this motif to the CNL group.

We identified 20 distinct motifs in the N-terminal domain from the 60 CNL proteins using MEME (Figure 2 ; MEME output 4; see supplemental data online). Fourteen motifs were common and found in more than six CNL proteins. Up to seven motifs were present in individual proteins. In 49 proteins, one of two distinct MEME motifs, 1 or 7, was coincident with the CC pattern identified by COILS. We identified three patterns of CC domains based on shared MEME motifs (see supplemental data online). These three CC motif patterns (CNL-A, CNL-B, and CNL-C/D) matched the subgroups defined by intron position (Figure 1) and the clades identified in phylogenetic analyses using the NBS domain (see below). Pair-wise comparisons of motifs demonstrated little sequence similarity or overlap between distinct motifs located in similar positions in the CC domains of these three subgroups. One subgroup was divided further; the CNL-C motif pattern was closely related to but distinct from the CNL-D pattern. Among the five CN proteins, the CC domain of the CN-B class was closely related to that of the CNL-B class, whereas the CN-C class was more divergent (see supplemental data online). Although At5g45440 did not contain a predicted CC motif, it did have conserved N-terminal motifs (MEME output 4; see supplemental data online). The BLAST search of the Arabidopsis genomic sequence described above also revealed a gene, At3g26470, that encodes only a CC domain related to the CNL-A subgroup (score of 5e-48); this is the C protein listed in Table 1.



View larger version (58K):
[in this window]
[in a new window]
 
Figure 2. Motif Patterns in CNL and TNL Proteins.

Different colored boxes and numbers indicate separate and distinct motifs identified using MEME (Bailey and Elkan, 1995Go) and displayed by MAST (Bailey and Gribskov, 1998Go). Motifs are colored the same in (A), (B), and (C). ID, identifier number.

(A) Examples of summarized and aligned MEME motifs for different domains of CNL and TNL proteins. All proteins are displayed in the supplemental data online. Thin dotted lines indicate their linear order. Motifs from the MEME analyses in supplemental data online (MEME outputs 1 to 6) were consolidated and aligned manually in a spreadsheet. To allow alignment, the size of the colored and numbered box does not correspond to the size of the motif. Because motif analyses had to be performed for each domain separately for each of the CNL and TNL groups of proteins, numbers and colors are specific only to that domain. The MEME "score" for the overall match of the protein to the motif models is given as a P value. Missing motifs may indicate either a poor match (>e-04) or a deleted domain.

(B) Examples of MEME output of the same proteins summarized in (A). Data for all proteins are available in the supplemental data online (MEME outputs 1 to 6). The sizes of the boxes and the gaps between motifs are drawn according to scale to illustrate the relative sizes and positions of each domain and motif that is not displayed in (A).

(C) Two examples of the motifs found in individual CNL and TNL protein sequences that are displayed in (A) and (B). Colors were added manually to illustrate the motifs identified by MEME and displayed by MAST. MEME motif alignments with the sequences are available in the output of the MAST program in the supplemental data online (MAST outputs 1 to 6).

 
The NBS Domain
Previous work had identified eight major motifs in the NBS region, and several of these motifs demonstrated different patterns depending on whether they were present in the TNL or CNL groups (van der Biezen and Jones, 1998bGo; Meyers et al., 1999Go). We analyzed the 149 TNL and CNL predicted proteins using MEME. MEME identified motifs that matched the eight major motifs identified previously. However, MEME identified more than eight motifs. The configuration of the motifs identified by MEME reflected conservation within subgroups and diversity between different subgroups of TNL and CNL sequences (Figure 2; see supplemental data online). The eight major motifs differed in their divergence within and between the CNL and TNL groups (Table 3). In the current study, the pre-P-loop sequence (described previously as part of the TIR [Meyers et al., 1999Go]) and the P-loop were considered as a single motif. The P-loop, kinase-2, RNBS-B, and GLPL motifs demonstrated a high level of similarity between CNL and TNL proteins (Table 3). The RNBS-A and RNBS-D motifs were dissimilar, and the RNBS-C motif had low similarity between the Arabidopsis CNL and TNL proteins (Table 3), as was observed for plant R protein homologs in general (Meyers et al., 1999Go).


View this table:
[in this window]
[in a new window]
 
Table 3. Major Motifs in Predicted Arabidopsis CNL and TNL Proteins

 
Although not immediately apparent from the consensus sequence shown in Table 3, the second and third amino acids of the GLPL motif in the NBS of many TNL proteins did not match the commonly identified consensus core GLPL (see NBS alignment in the supplemental data online). Rather, the most common variations contained the consensus GNLPL or SGNPL and lacked contiguous GL residues within the core of the motif. This is critical to the design of degenerate oligonucleotide primers for the amplification of R genes that often have used this motif (see Discussion).

Finally, the eighth conserved major motif in the NBS has been called MHDV, based on clearly conserved amino acids in the CNL proteins (Collins et al., 1998Go). This motif was beyond the most C-terminal RNBS-D motif identified in our previous work (Meyers et al., 1999Go) and was highly conserved in CNL proteins, with a minor variation (QHDV) present in the CNL-A subgroup (Table 3; see supplemental data online). The MHDV motif is slightly different in the TNL proteins, but it is clearly present and also starts with a conserved Met followed by a His (Table 3). The MHDV motif was not identified in any of the proteins that lacked an LRR (CN or TN), nor was it present in the divergent NL proteins At5g45510 and At4g19050. We considered this motif to represent the C-terminal end of the NBS, at least when LRRs are present. Mutations in the conserved Asp of the CNL variant of the MHDV motif resulted in a gain-of-function phenotype in the potato Rx protein (Bendahmane et al., 2002Go). In total, the eight NBS motifs from P-loop to MHDV spanned ~300 amino acids in the CNL and TNL proteins.

The LRR Region
The LRR region is characterized by leucine-rich repeats C-terminal to the NBS in many R genes (Jones and Jones, 1997Go). However, the precise start and number of LRRs had not been well defined in many NBS-LRR proteins. Therefore, we analyzed all predicted protein sequences encoded 3' to the NBS to define the boundaries, numbers, and diversity of repeats in this domain. Initially, MEME was used as described previously except that the length and number of sequences required two rounds of analysis. First, samples of the CNL and TNL groups were analyzed together; then, all sequences within each group were analyzed separately. Parallel to the MEME analysis, we used the method described by Mondragon-Palomino et al. (2002)Go to estimate the number of LRR units in each protein. We manually combined secondary structure analyses derived from the program SSPro (Pollastri et al., 2002Go) with LRR consensus sequences to identify the individual repeats.

As a first step in defining the full LRR, we sought to determine if the LRR domain began immediately C terminal to the MHDV motif (the last conserved NBS motif) or if a spacer region separated the two domains. We analyzed all amino acids encoded immediately 3' to the encoded MHDV motif. In TNL genes, a short exon averaging ~300 bp was found between the encoded NBS described above and longer exons more 3' that clearly encoded LRR motifs. This exon is conserved in diverse TNL genes from other plant species (see above). In the latter half of this exon, previous studies identified hypervariable amino acids and predicted up to two LRR motifs encoded for some Arabidopsis TNL genes (Noel et al., 1999Go). Our MEME analysis identified motifs matching the canonical LRR patterns (Jones and Jones, 1997Go) encoded at the 3' end of this exon (identified as 5 or 14 in the NBS MEME analysis; see supplemental data online). The manual analysis confirmed two LRRs encoded in this exon. In addition, two conserved motifs that were not identified as LRRs were found between the NBS and LRR domains in TNL proteins. MEME motif 8 was bisected by the intron, and motif 11 was in the middle of the short exon N-terminal to the first LRR (MEME analysis 2; see supplemental data online). Therefore, there were ~65 amino acids between the NBS and LRR domains in TNL; we designated this non-LRR region the NL linker (NBS-LRR linker).

CNL genes predominantly lacked an intron between the NBS and the LRR. Only the CNL-A class had an intron in this position (Figure 1). Manual analysis of LRR motifs in the CNL proteins identified LRR repeats starting ~40 amino acids C terminal to the NBS MHDV motif, consistent with previous analyses of individual CNL proteins (Bent et al., 1994Go; Grant et al., 1995Go; Warren et al., 1998Go; Cooley et al., 2000Go). MEME motif analysis in this region of the CNL sequences identified a short conserved NL linker of ~40 amino acids. The motif for this linker was conserved within the different CNL classes but varied among classes (Table 3; motifs 9 [latter half], 14, and 28 in MEME analysis 5; see supplemental data online). In TN and CN proteins that lack the LRR (Meyers et al., 2002Go), we found no evidence of the NL linker protein sequences.

The C-terminal boundary of the LRR region was defined as the point at which LRRs no longer could be recognized. Based on the manual and MEME analyses, LRRs constituted approximately half of the C-terminal region in the TNL proteins and nearly the entire C-terminal region in CNL proteins. The average TNL LRR domain contained a mean of 14 LRRs (standard deviation of 4.2, range of 8 to 25; see supplemental data online). MEME analysis of the TNL LRR domains identified ~10 distinct MEME motifs that spanned ~350 amino acids. The CNL proteins also had a mean of 14 LRRs (standard deviation of 3.5, range of 9 to 25; see supplemental data online), including ~10 distinct MEME motifs with >350 amino acids. Although MEME motifs did not correspond precisely to individual LRR units, duplication patterns were observed clearly as repeated motifs in >18 CNL LRRs and 46 TNLs (MEME analyses 3 and 6; see supplemental data online). These data suggest that CNL and TNL LRR domains are similar in length and that duplications of LRRs accounted for much of the variation in length.

Finally, the MEME motifs and patterns of repeats in the manually defined LRRs were examined to determine the conservation of LRRs within and among CNL and TNL proteins. MEME identified a variety of LRR-related motifs. These MEME motifs were less consistent in order, spacing, and number than MEME motifs identified in the other domains (see supplemental data online). Most proteins did not have a regular pattern; however, several predicted proteins had highly regular patterns of repeats, including At1g69550, At5g44510, and At2g14080 and to a lesser extent At1g27170 and At1g27180. Few motifs were similar between TNL and CNL proteins (MEME analysis 7; see supplemental data online). Motif 1 in the LRR domain of both TNL and CNL proteins was related (Table 3). This MEME-identified motif corresponds to the previously described, conserved third LRR, in which a mutation in the Arabidopsis CNL RPS5 had epistatic effects on disease resistance (Warren et al., 1998Go) and a mutation produced a gain-of-function phenotype in the potato Rx protein (Bendahmane et al., 2002Go).

In the TNL proteins, C terminal to the location of the motif-1 complex, duplicated patterns of LRR motifs were observed. In some subgroups, predominantly TNL-E, separate exons encoding duplications within the LRR region were common (Figure 1). These duplicated exons were recognizable by the repetition of LRR motif 1; this motif was encoded at the 5' end of these exons. The 24 proteins in subgroup TNL-H were homogeneous in the composition and arrangement of their LRR motifs, probably reflecting the recent expansion of the subgroup (see supplemental data online). Motif 4 included the most C-terminal recognizable LRR motif in most TNL subgroups (Table 3; see supplemental data online).

In the CNL proteins, the LRR motif patterns were conserved within subgroups, but each subgroup was characterized by distinct sets of motifs. Motif 1 was conserved in all CNL subgroups except for CNL-A, which lacked this motif. Several motifs were unique to individual subgroups (see supplemental data online). The final LRR motif detectable in most CNL proteins was motif 8 (Table 3; see supplemental data online). The last occurrence of this motif typically ended 40 to 80 amino acids before the C terminus of the protein.

The C-Terminal Domain
The CNL and TNL groups differed markedly in the size and composition of sequences C-terminal to the LRR domain. The difference in the C-terminal domain accounted for much of the increased total length of TNL versus CNL proteins. The CNL proteins had conserved motifs present in the 40– to 80–amino acid C-terminal domain; like the NL linker, these motifs were specific to the CNL-A, CNL-B, and CNL-C/D subgroups (Table 3; see supplemental data online). By contrast, the C termini of the TNL proteins had a large number of non-LRR conserved motifs spanning ~200 to 300 amino acids. As reported previously for TNL proteins of known function (Gassmann et al., 1999Go; Dodds et al., 2001Go), the C-terminal non-LRR domain is approximately as large as the LRR domain. The two motifs, 8 and 25 (MEME analysis 3; see supplemental data online), began subsequent to the last LRR (motif 4) in most proteins of all TNL subgroups. C-terminal motifs were conserved within each subgroup but were less conserved among subgroups than were motifs within the TIR or NBS domains (see supplemental data online). In several members of the TNL-F subgroup, duplications of entire exons resulted in duplicated C-terminal motifs. Although the functional roles of these C-terminal motifs are unclear, their conservation and wide distribution throughout the TNL subgroup suggests that these domains are important for protein function.

A putative nuclear localization signal (NLS) was described by Deslandes et al. (2002)Go in the C-terminal domain of the Arabidopsis TNL:WRKY resistance protein RRS1 and cited as evidence for the nuclear localization of R genes (Lahaye, 2002Go). The motif patterns in the C-terminal domain of RRS1 and its putative Col-0 ortholog At5g45050 were similar to those of other TNL-A subgroup members. MEME motif 17 included the putative NLS identified by Deslandes et al. (2002)Go and was found in the C-terminal domain of most TNL proteins (MEME analysis 3; see supplemental data online). However, the particular amino acids representing the putative NLS sequence were not conserved among TNL proteins, suggesting that the proposed NLS in RRS1 is either spurious or a unique variant of the conserved C-terminal domain found in most TNL proteins.

Nonconserved Domains
Nine TNL proteins had unusual configurations or additions other than the TIR-NBS-LRR C-terminal domain structure described above (Figure 1). Most of these proteins were in either the TNL-A or the TNL-C subgroup. Several of these predicted anomalous domain configurations have been confirmed in previous experimental analyses (Deslandes et al., 2002Go; Meyers et al., 2002Go). At1g27170 and At1g27180 encode duplications of the TIR domain; At4g36140 and At4g19500 encode TN:TNL fusions; and At2g17050 and At4g19520 encode TNL:TX fusions. TN or TX proteins have been suggested to play a role as adapter proteins (Meyers et al., 2002Go). In addition, the R gene RRS-1 and its Col-0 homolog At5g45050 encode a WRKY motif fused at the C terminus (Deslandes et al., 2002Go). At4g12020 is the most unusual TNL protein; it contains a WRKY-related protein domain at the N terminus and a sequence similar to mitogen-activated protein kinase kinase kinases in place of the C-terminal domain. Based on the varied similarities of its 16 exons, At4g12020 appears to be a chimera composed of parts of five other genes, and it shares a predicted promoter region of only 273 bp with At4g12010 (see below) (Figure 3A) . At5g17890 encodes a TNL protein with a C-terminal fusion to a neutral zinc metallopeptidase; a similar domain also is present in one unusual CNX protein, At5g66630. The chimeric At5g66630 apparently resulted from a small translocation of the 5' end of At5g66890 and resides within a small cluster of homologs, At5g66610 to At5g66640 (Figure 3B). The neutral zinc metallopeptidase family is encoded by only seven paralogs in the Col-0 genome, and two of these seven are part of either CNX or TNLX proteins (Figure 1). The functional significance of these unusual domain configurations and additions is unknown.



View larger version (28K):
[in this window]
[in a new window]
 
Figure 3. Modifications of Two TNL Proteins Caused by Genic Rearrangements.

(A) Gene At4g12020 encodes protein domains similar to five different genes. Exons (Ex) 2 and 9 encode in-frame fusions of distinct protein domains. Based on sequence homologies, exons 2 and 3 apparently were inserted into exons 1, 4, and 5. Exons 6 to 9 encode TNL domains fused at the 3' end to a mitogen-activated protein kinase kinase kinase homolog. The complete gene was found in a head-to-head orientation with TNL At4g12010; 273 bp separates the predicted translational start codons of these genes.

(B) Gene At5g66630 encodes an NBS fused to neutral zinc metallopeptidase motifs; the NBS of this gene is related most closely to a nearby family of CNL genes, one of which is lacking the NBS region, suggesting a translocation of this domain. At5g17890 is a TNL fused to neutral zinc metallopeptidase motifs homologous with At5g66630 (BLAST E value = 3e-82).

 
Phylogenetic Analysis of Predicted Proteins Containing NBS Sequences Related to R Genes
We assessed sequence diversity and relationships by generating two phylogenetic trees, one for the CNL proteins and one for the TNL proteins (Figures 4A and 4B) . NBS sequences were used because the NBS domain is present in both CNL and TNL proteins and contains numerous conserved motifs that assist proper alignment. The availability of full-length sequences allowed the use of the entire NBS domain (from ~10 amino acids N terminal to the first Gly in the P-loop motif to ~30 amino acids beyond the MHDV motif), in contrast to the earlier analysis of Meyers et al. (1999)Go, which used only the region between the P-loop and GLPL motifs. Both CNL and TNL trees showed long branch lengths and closely clustered nodes, reflecting a high level of sequence divergence (Figures 4A and 4B). The nodes closest to the branch tips were supported most highly, although increased support would have been found for more of the internal nodes if the number of sequences had been reduced. The trees are robust, however, because phylogenetic analysis using both distance and parsimony algorithms produced similar trees (data not shown).




View larger version (68K):
[in this window]
[in a new window]
 
Figure 4. Phylogenetic Relationship of NBS-Containing Predicted Proteins from the Complete Arabidopsis Genome.

(A) Tree of CN and CNL proteins.

(B) Tree of TN and TNL proteins.

Neighbor-joining trees from distance matrices constructed according to the two-parameter method of Kimura (1980)Go using the aligned NBS protein sequences. Branch lengths are proportional to genetic distance. Sequence identifiers are given for each sequence as designated by the Arabidopsis Genome Initiative (2000)Go. Names of known resistance gene products are indicated in boldface. The number of exons for each gene is indicated at right by gray brackets. Asterisks indicate that our gene prediction differed from that in MIPS and TIGR; superscript "p" indicates a predicted or potential pseudogene (see text). The Streptomyces sequence rooted both trees as the outgroup. Numbers on branches indicate the percentage of 1000 bootstrap replicates that support the adjacent node; bootstrap results were not reported if the support was <50%. Black braces at right in each tree indicate the subgroup names; subgroups were defined based on phylogeny and intron position/number (see text). Proteins that contained either more or less than the CC-NBS-LRR domains (in [A]) or the TIR-NBS-LRR domains (in [B]) are indicated with a code after the identifier that refers to protein configurations in Table 1. Two sequences each had two NBS domains; these domains were included in the analysis with the primary subgroup (TNL-A) indicated in parentheses by the position of the second NBS. The trees are available at http://niblrrs.ucdavis.edu with links to data for each gene.

 
The phylogenetic relationships based on the NBS predominantly recapitulated patterns of protein and gene structure (Figures 4A and 4B). The motif patterns defined by MEME for each of the domains identified monophyletic clades within each of the CNL and TNL groups. In addition, genes that encode sequences in these clades shared intron positions and to a lesser extent numbers (Figures 1, 4A, and 4B). Together, intron numbers and positions, protein motifs, and phylogenetic analyses defined four subgroups of CNL proteins, eight subgroups of TNL proteins, and a pair of divergent NL proteins (Figures 1, 4A, and 4B). Among the CNL and TNL subgroups, only CNL-C was not monophyletic; phylogenetic analysis suggested that the CNL-D subgroup was derived from the CNL-C subgroup (Figure 4A). TNL subgroups were consistent with our previous phylogenetic analysis using the TIR domain (Meyers et al., 2002Go). The consistency among these three distinct sources of data—protein motifs, intron positions, and sequence diversity for the NBS and TIR regions—suggests that shuffling of protein domains has been rare among distantly related CNL or TNL sequences.

Although TX, TN, and TNL sequences all contain TIR domains and presumably share an ancient ancestor, previous phylogenetic analyses of only the TIR-encoding domain demonstrated the diversification of two monophyletic clades of TN sequences and one clade of TX sequences (Meyers et al., 2002Go). Therefore, TIR domain relationships indicate that TNL genes evolved independently of most TX and TN genes. Phylogenetic analysis of the NBS region confirmed the existence of two major TN clades distinct from the TNL clades (Figure 4B). The NBS analysis also was consistent with several TN sequences being most closely related to TNL sequences rather than to other TN sequences (Meyers et al., 2002Go).

The known Col-0 R proteins and the closest homologs of the known Arabidopsis R proteins identified in ecotypes other than Col-0 were mapped onto the phylogenetic trees. Known R proteins were found in clades distributed throughout both trees. The TNL tree included RPS4, RPP4, RPP2A, and RPP28 from Col-0 as well as the closest Col-0 homologs of RPP1, RPP5, and RRS1. The CNL tree included RPM1, RPS2, and RPS5 from Col-0 and the closest Col-0 homologs of RPP8 and RPP13. Only five subgroups, NL-A, CNL-A, TNL-C, TNL-D, and TNL-H, did not include a known R protein. Therefore, more than two-thirds of all Arabidopsis Col-0 NBS-LRR proteins were within the same subgroup as at least one protein with a demonstrated role in disease resistance.

Genetic Events Resulting in the Expansion of the NBS-LRR Gene Family in Col-0
The physical distribution of NBS-LRR–encoding genes across the Col-0 genome was investigated to illustrate the genetic events that shaped the complexity and diversity of these genes. Both CNL and TNL genes showed obvious clustering in the genome (Figure 5) . We also examined the distribution of TX, TN, and CN genes because these related genes are linked closely to some TNL genes (Meyers et al., 2002Go). We used the same parameters to define a cluster as Richly et al. (2002)Go; two or more CNL, TNL, TX, TN, or CN genes that occurred within a maximum of eight ORFs were considered to be clustered. This is a useful operational definition because the numbers or sizes of clusters changed little when the maximum number of intervening ORFs was increased to 25 or even 50. In most cases, the function is not known for the other genes in the clusters that do not encode NBS-LRR proteins. Approximately two-thirds of CNL and TNL genes (109 of 149) were distributed in 43 clusters; the remaining 40 CNL and TNL genes were singletons (Table 4, Figure 5; see supplemental data online). The largest cluster consisting of only NBS-LRR–encoding genes was the RPP4/RPP5 cluster, which constituted seven TNL sequences on chromosome IV (see supplemental data online). Sixteen clusters contained combinations of TNL or CNL genes with TX-, TN-, or CN-encoding genes (Table 4; see supplemental data online); the largest of these clusters contain TNL and TN genes or TNL and TX genes and have been described previously (Meyers et al., 2002Go). Of these 16 clusters, 12 contained TNL genes paired with TX or TN genes, one contained four CNL genes with a TX gene, and one contained three TNL genes with a CN gene (see supplemental data online). The two diverse NL genes, At4g19050 and At5g45510, were adjacent to one and two CN genes, respectively.



View larger version (15K):
[in this window]
[in a new window]
 
Figure 5. Physical Locations of Arabidopsis Sequences That Encode NBS Proteins Similar to Plant R Genes.

Boxes above and below each Arabidopsis chromosome (chrm; gray bars) designate the approximate locations of each gene. Chromosome lengths are shown in megabase pairs on the scale at top. A list of the clusters is given in the supplemental data online. Similar figures are available at http://niblrrs.ucdavis.edu with links to data for each gene.

 

View this table:
[in this window]
[in a new window]
 
Table 4. Clusters of CNL- and TNL-Encoding Genes in Arabidopsis Col-0

 
We compared the phylogenetic analysis and the physical clustering data to determine if clusters were composed solely of monophyletic clades (Figures 4A and 4B; see supplemental data online). Four clusters contained CNL and TNL genes from diverse subgroups, excluding the TNL-A/B pairs (see above). The clusters were At5g17880 to At5g17970 (representing subgroups TNL-A, -B, and -H), At5g18350 to At5g18370 (TNL-G and -H), At5g40060 to At5g40100 (TNL-F and -D), and At5g47250 to At5g47280 (CNL-A and -B). These clusters of mixed subgroups could have arisen as a result of either selective pressures (Richly et al., 2002Go) or chance events that colocalized the genes. Richly et al. (2002)Go estimated the number of heterogeneous clusters expected if the genes were arranged randomly in the genome, based on the total number of genes within the boundaries of the cluster. Using the same formula with the current estimated total of 29,028 genes in Arabidopsis (http://www.tigr.org), the number of mixed clusters predicted to occur at random was greater than the four that we identified. Therefore, in contrast to Richly et al. (2002)Go, we conclude that these four mixed clusters are likely the result of random associations among the 149 NBS-LRR–encoding genes in the Col-0 genome and do not provide evidence for selection for mixed clusters.

The genes that encode the TNL-A and TNL-B proteins showed an unusual pattern of clustering. Seven clusters were identified that contained 11 paired sets of genes encoding members of the TNL-A and TNL-B subgroups (Figure 6A) . Five clusters encoded one representative of each subgroup, and one cluster encoded 17 TNL and TX genes. Because the TNL-A and TNL-B genes each form a monophyletic group, the duplication of these genes took place after an ancestral pairing event and preserved their orientation. Ten of the 11 pairs of TNL-A and TNL-B genes maintained a head-to-head configuration (At4g19500 was inverted; Figure 6A). The most complex cluster included 17 TNL and TX genes (Meyers et al., 2002Go) and spanned a 246-kb region on chromosome V that included 39 predicted genes (Figure 6A). This cluster includes the known R genes RPS4 (Gassmann et al., 1999Go) and RRS1 (Deslandes et al., 2002Go). It is not known if the complexity of this cluster or the pairing of the TNL-A and TNL-B genes reflects selective pressure to maintain functional pairs of genes. It also is interesting that 9 of the 11 genes in the TNL-A subgroup encode proteins with very different and unusual additional domains (see above; Figures 1 and 6A). The additional domains do not share high sequence similarity and therefore apparently were acquired independently. The importance of these additional domains to the functions of most of these proteins is unknown; however, At5g45050 confers recessive resistance to Ralstonia solanacearum (Deslandes et al., 2002Go), and At4g19500 was identified recently as the Peronospora parasitica resistance gene RPP2A (E. Sinapidou, K. Williams, and J.L. Beynon, unpublished data).



View larger version (36K):
[in this window]
[in a new window]
 
Figure 6. Multiple Localized Duplication Events That Resulted in Clusters of NBS-LRR–Encoding Genes.

Dotted lines designate the boundaries of duplication events inferred from closely related sequences. Triangles indicate the insertion site of a gene, transposon, or retrotransposon.

(A) An ancient pairing of genes that is present in ~11 occurrences in the Col-0 genomic sequence. Genes labeled A belong to the monophyletic subgroup TNL-A, and genes labeled B belong to the monophyletic subgroup TNL-B. See Figure 4 for more detailed phylogenetic relationships. B genes encode predicted TNLs, whereas A genes encode modified TNLs with additional protein motifs, as indicated below the gene identifier.

(B) A complex family of CNLs and unrelated genes on chromosome I. The evolutionary history of the cluster was inferred based on observed sequence homologies in the Col-0 genomic sequence. Boldface numerals indicate the order of events predicted in this region, as inferred from relationships of pairs of genes and gene fragments. Dashed lines that connect the ends of the clusters indicate the boundaries of a single region shown at different inferred evolutionary time points. The scheme at bottom represents the extant Col-0 sequence. The black arrows indicate that evidence of multiple duplication events was identified, but the order of these events could not be distinguished. ncRNA, noncoding RNA identified in the gene annotation.

 
Some of the CNL and TNL genes that were not in clusters (singletons) were related closely to clustered genes (Figures 4A and 4B; see supplemental data online). Small translocations apparently have separated these members of monophyletic clades and may have occurred quite frequently in the evolution of the Arabidopsis genome. These rearrangements have been local, to positions elsewhere on the same chromosome, or to other chromosomes. For example, two singletons, At1g59620 and At1g59780, are separated by ~17 and ~33 genes from the large cluster shown in Figure 6B on chromosome I. In the TNL-H subgroup, closely related sequences At1g63730 to At1g63750 are found as a cluster; however, the most closely related TNL-H homologs of these genes are found on chromosomes II, IV, and V (Figure 4B).

A comparison of the physical positions and the phylogenetic analysis revealed both local and distant duplications of CNL and TNL genes. The majority of the clusters contained closely related sequences from within the same CNL or TNL subgroup, indicating localized duplication events, most likely tandem duplications resulting from unequal crossing over. Several of these clusters have been noted previously and correspond to clusters of R genes defined by classic genetics (Holub, 2001Go). Expansion of a TNL cluster by tandem duplications and insertions of retrotransposons has been described for the RPP4/RPP5 family (Noel et al., 1999Go). We examined the patterns of sequence similarity to infer the complex pattern of localized duplications and insertions that resulted in the expansion of two related CNL clusters on chromosome I (Figure 6B). The locations of gene fragments allowed us to infer the direction and boundaries of some of the duplication events. One of these clusters is a tightly clustered array of three CNL genes, whereas the other includes five CNL genes and numerous unrelated genes (Figure 6B). Early events in the expansion of these clusters included a distal duplication of single CNL genes and localized duplications of single genes, pairs of genes, and/or gene fragments. Later events included insertions of single genes and retrotransposons and finally a recent duplication of approximately eight genes, including two CNL genes (Figure 6B).

To investigate the role of large segmental duplications in the expansion of NBS-encoding genes, we analyzed the positions of CNL, TNL, and related genes relative to segmental duplications detected in the Col-0 genome. Boundaries of 81 previously described duplicated regions were derived as gene identifier numbers from http://www.psb.rug.ac.be/bioinformatics/simillion_pnas02/ (Simillion et al., 2002Go). These 81 duplications were all from those that contained at least 10 genes in common. We confirmed these genome duplications by BLAST comparison of all predicted Arabidopsis proteins against each other and displayed sequence similarities as a diagonal plot along each chromosome (see supplemental data online). Chromosomal positions using coordinates corresponding to the current annotation for each boundary gene as well as all of the CNL- and TNL-related genes also were displayed linearly using GenomePixelizer (see supplemental data online) (Kozik et al., 2002Go). The boundaries of the duplicated segments were joined by lines, as were CNL, TNL, and related genes with >60% amino acid identity.

The locations of CNL- and TNL-related genes relative to duplicated segments and their persistence in the duplicated regions then were assessed by visual inspection of the diagonal plot and the linear GenomePixelizer display. A total of 124 CNL- and TNL-encoding genes were located in duplicated regions (Table 5; see supplemental data online). These were distributed in 43 of the 162 segments involved in the 81 duplications. Twenty-five CNL- and TNL-related genes were not located in any of the 162 duplicated regions; however, some of these genes had paralogs with >60% identity that did reside in one segment of a pair of duplicated regions (e.g., At4g04110 and At5g58120). In 25 cases, the CNL- and TNL-related genes were present in only one of the two segments involved in the duplication: duplications 1.1.4 and 3.4.13 (Table 6; see supplemental data online). In only nine cases were the CNL- and TNL-related genes present in both segments involved in the duplication: duplications 1.1.2 and 3.5.1 (Table 6; see supplemental data online). However, close inspection of the diagonal plot revealed a more complex situation than simple duplication of a chromosomal region. Even when the genes resided in both members of a segmental duplication, only rarely were the NBS-LRR genes flanked by syntenic genes and therefore located along the diagonal line of the diagonal plot (see supplemental data online). Therefore, although some of the amplification of CNL- and TNL-encoding genes occurred as a result of segmental duplications that involved 10 or more genes, much of the amplification occurred independently of such duplications. The frequent presence of CNL- and TNL-encoding genes in only one segment of a duplication and at nonduplicated positions and their variable positions within duplicated segments suggest that microscale events involving translocations of NBS-LRR–encoding genes around the genome as well as deletions occurred after the segmental duplications by as yet undefined genetic mechanism(s).


View this table:
[in this window]
[in a new window]
 
Table 5. Distribution of Three Multigene Families That Encode NBS-LRR, Cytochrome P450, and LRR Kinase Proteins in the Arabidopsis Col-0 Genome Relative to Segmental Duplications

 

View this table:
[in this window]
[in a new window]
 
Table 6. Relationships between Segmental Duplications and NBS-Encoding Genes

 
We also analyzed sequence data from the Arabidopsis ecotype Landsberg erecta (Ler) to examine the types of genetic events that shaped NBS-LRR gene clusters observed through intergenomic comparisons. In Col-0, the absence of clustering of the two CNL singletons (At5g43470 and At5g48620) belies the complexity of events that led to the Col-0 haplotype. In Ler, there are four syntenic CNL genes that include RPP8 (McDowell et al., 1998Go). Based on flanking genes and gene fragments, we were able to infer the history of rearrangements involving these CNL sequences (Figure 7) . The initial event generating the locus that includes At5g43470 likely involved a small duplication from the locus that includes At5g48620 to a position ~2.3 Mb away on the same chromosome. A subsequent duplication event produced the functional RPP8 gene and the homolog RPH8 to generate the extant Ler haplotype. This haplotype then underwent an unequal crossing-over event to produce the extant Col-0 haplotype (McDowell et al., 1998Go; Cooley et al., 2000Go). We sequenced 12.8 kb around the locus in Ler syntenic with At5g48620 and found evidence of a duplication event that produced the pair of CNL genes in Ler (Figure 7). These inferred complex histories demonstrate that gene duplications, translocations, and insertions of genes and mobile elements all have contributed to the configuration of several CNL and TNL clusters and singletons (Figures 6 and 7). As additional genomic sequence from other Arabidopsis ecotypes becomes available, it will become possible to infer the evolutionary history of many CNL and TNL genes and to determine the relative frequencies with which rearrangements, duplications, and deletions occurred.



View larger version (22K):
[in this window]
[in a new window]
 
Figure 7. Rearrangements among RPP8 Homologs in Arabidopsis Ecotypes.

Two clusters were analyzed in Col-0 and Ler to determine the genetic rearrangements in their evolutionary history. The inferred ancient arrangement of the cluster and the earliest events are indicated at top. Below, later events and the extant genomic arrangement in Col-0 and Ler are shown. Dotted lines designate the boundaries of duplication events inferred from closely related sequences. Dashed lines that connect the ends of the clusters indicate the boundaries of a single region shown at different inferred evolutionary time points. Sequences for the Ler RPP8 cluster were obtained from GenBank (McDowell et al., 1998Go).