An improved mitochondrial reference genome for Arabidopsis thaliana Col-0

Arabidopsis thaliana remains the foremost model system for plant genetics and genomics, and researchers rely on the accuracy of its genomic resources. The first completely sequenced angiosperm mitochondrial genome was obtained from A. thaliana C24 (Unseld et al., 1997), and more recent efforts have produced additional A. thaliana reference genomes, including one for Col-0, the most widely used ecotype (Davila et al., 2011). These studies were based on older DNA sequencing methods, making them subject to errors associated with lower levels of sequencing coverage or the extremely short read lengths produced by early-generation Illumina technologies. Indeed, although the more recently published A. thaliana mitochondrial reference genome sequences made substantial progress in improving upon earlier versions, they still have high error rates. By comparing publicly available Illumina sequence data to the A. thaliana Col-0 reference genome, we found that it contains a sequence error every 2.4 kb on average, including 57 SNPs, 96 indels (up to 901 bp in size), and a large repeat-mediated rearrangement. Most of these errors appear to have been carried over from the original A. thaliana mitochondrial genome sequence by reference-based assembly approaches, which has misled subsequent studies of plant mitochondrial mutation and molecular evolution by giving the false impression that the errors are naturally occurring variants present in multiple ecotypes. Building on the progress made by previous researchers, we provide a corrected reference sequence that we hope will serve as a useful community resource for future investigations in the field of plant mitochondrial genetics.

variants represented sequencing artefacts or actual biological differences between the two Col-0 samples, we extracted diagnostic k-mers from the raw reads used in our analysis and those from the original A. thaliana Col-0 sequencing effort (SRA SRR307226). We confirmed that all the variants identified in our assembly were strongly supported in both sets of sequencing reads (Table 1), suggesting that the differences represent assembly errors in the published Col-0 reference sequence rather than real polymorphisms. We further validated these variants calls using the double-stranded consensus sequence from a dataset (SRA SRR6420475) that was generated with a highly accurate technique known as duplex sequencing (Schmitt et al., 2012).
By comparing the same set of 57 SNPs and 96 indels to the raw reads in the resequenced C24 dataset (SRA SRR307231), we identified 28 variants for which the original reference allele was supported in C24 (Table 1). These cases, therefore, represent true polymorphisms that distinguish the C24 and Col-0 ecotypes but were not detected in the original reference-based assembly of the Col-0 mitochondrial genome such that the published Col-0 sequence improperly retains the C24 allele. In contrast, we found that the raw C24 sequence reads did not support the original reference allele in the remaining 125 variants (82%) ( Table 1).
These cases appear to result from errors in the original C24 genome sequence (Unseld et al., 1997) that were not detected in either the resequencing of C24 or the reference-based assembly of Col-0 and, thus, have been propagated across reported genome sequences from multiple ecotypes (Davila et al., 2011). Many of these errors are found in regions differing by multiple SNPs or by multi-nucleotide indels, so it is not surprising that they were difficult to detect with short-read sequencing data. However, there are also many individual SNPs and 1bp indels in this set (Table 1), so the source of the assembly artefacts is unclear in some cases.
Our newly assembled A. thaliana Col-0 reference sequence also differs from the published Col-0 sequence in two major structural variants. First, it includes a 901-bp sequence that is absent from the published Col-0 genome. The full-length of this sequence is clearly detectable in the raw reads of the original Col-0 study (SRA SRR307226). It would be inserted after position 48,895 in the published Col-0 genome (JF729201) and would correspond precisely to the last 901 bp of the C24 reference genomes. The fact that this deletion occurs exactly at the point where the circular reference genome map had been arbitrarily "cut" for reporting as a linearized sequence suggests that it might have resulted from an inadvertent byproduct of sequence handling and reorientation. Second, our newly assembled A. thaliana Col-0 reference sequence differs in a large rearrangement, apparently resulting from recombination between a pair of identical 453-bp inverted repeats at positions 36,362-36,818 and 143,953-144,409. The clear majority (30 of 33; 91%) of read-pairs spanning these repeats support our reported conformation. We are not able to test for similar support in the raw Col-0 reads from Davila et al. (2011) because their insert sizes are too short to span the repeat copies, but we did verify that our reported configuration predominates in Illumina paired-end and PacBio sequencing reads from four other Col-0 datasets (NCBI SRA SRR1581142, SRR5012968, SRR5882797, and SRP073602). Therefore, this configuration is likely the most common among different Col-0 seed stocks.

Subsequent research in Arabidopsis mitochondrial genetics
For good reason, A. thaliana is the "go-to" model for studies of plant mitochondrial genome function, stability, mutation, and molecular evolution (Davila et al., 2011;Christensen, 2013;Cupp and Nielsen, 2014;Zampini et al., 2015;Gualberto and Newton, 2017). As such, there is great incentive to make the Arabidopsis reference mitochondrial genomes the gold standard in the field. Indeed, the extensive characterization of structural variation in these genomes has gone a long way to accomplish this goal (Arrieta-Montiel et al., 2009). However, sequence errors still exist in the reported reference genomes with potentially detrimental and far-reaching effects on related research efforts. This is especially true because the actual rate of sequence evolution in plant mtDNA is usually very low (Wolfe et al., 1987), so even a modest amount of sequencing errors can result in a problematic signal-to-noise ratio. For example, a recent study was performed to infer the distribution and spectrum of mutations across the Arabidopsis mitochondrial genome and used the sequence variants that distinguish published C24 and Col-0 mtDNA sequences (Christensen, 2013). Such comparative analyses of published genomic data are commonplace and can make substantial contributions to the field, but it is now clear based on our reexamination of the Col-0 sequence that approximately 40% of the analyzed variants in that study were artefacts (Table 2).
Another recent investigation was conducted to detect de novo mutations in A. thaliana organelle genomes using deep sequencing (Zampini et al., 2015). The authors applied a natural and seemingly conservative approach by rejecting any identified mitochondrial variant that did not differ from "both" published Col-0 mitochondrial genomes, but this choice highlights two pressing concerns. First, it illustrates the continued confusion in the field about the fact that original A. thaliana reference mitochondrial genome is derived from C24 and not Col-0. Second, it reflects a misunderstanding about the extent to which the multiple available reference genomes constitute independent data points. The reference-guided approach used to assemble mtDNA sequences from C24, Col-0, and Ler (Davila et al., 2011) appears to have incorporated many errors and allelic variants from the reference genome into the new assemblies.
Nevertheless, those new assemblies are still reported as separate accessions on GenBank rather than as a set of variant calls, so there is a risk that the many errors shared between them will be falsely perceived as having been independently validated in two or more sequencing datasets. This concern is particularly relevant for the Ler sequence available on GenBank because it was generated with the same short 35-bp reads but a much lower level of sequence coverage -only 19´ compared to 230´ and 371´ for Col-0 and C24, respectively (Davila et al., 2011). For these reasons, it is important that researchers in the field of plant mitochondrial genetics be more broadly aware of the history and methodologies that produced the currently available reference mitochondrial genome sequence for A. thaliana.