Skip to main content

Main menu

  • Home
  • Content
    • Current Issue
    • Archive
    • Preview Papers
  • About
    • Editorial Board and Staff
    • About the Journal
    • Terms & Privacy
  • More
    • Alerts
    • Contact Us
  • Submit a Manuscript
    • Instructions for Authors
    • Submit a Manuscript
  • Other Publications
    • Plant Physiology
    • The Plant Cell
    • Plant Direct
    • The Arabidopsis Book
    • Teaching Tools in Plant Biology
    • ASPB
    • Plantae

User menu

  • My alerts
  • Log in

Search

  • Advanced search
Plant Cell
  • Other Publications
    • Plant Physiology
    • The Plant Cell
    • Plant Direct
    • The Arabidopsis Book
    • Teaching Tools in Plant Biology
    • ASPB
    • Plantae
  • My alerts
  • Log in
Plant Cell

Advanced Search

  • Home
  • Content
    • Current Issue
    • Archive
    • Preview Papers
  • About
    • Editorial Board and Staff
    • About the Journal
    • Terms & Privacy
  • More
    • Alerts
    • Contact Us
  • Submit a Manuscript
    • Instructions for Authors
    • Submit a Manuscript
  • Follow PlantCell on Twitter
  • Visit PlantCell on Facebook
  • Visit Plantae
Article CommentaryCOMMENTARY
Open Access

Is It Ordered Correctly? Validating Genome Assemblies by Optical Mapping

Joshua A. Udall, R. Kelly Dawe
Joshua A. Udall
aPlant and Wildlife Science Department, Brigham Young University, Provo, Utah 84602
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Joshua A. Udall
  • For correspondence: jaudall@gmail.com
R. Kelly Dawe
bDepartment of Genetics, University of Georgia, Athens, Georgia 30602
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site

Published January 2018. DOI: https://doi.org/10.1105/tpc.17.00514

  • Article
  • Figures & Data
  • Info & Metrics
  • PDF
Loading
  • © 2018 American Society of Plant Biologists. All rights reserved.

Abstract

Long-read single-molecule sequencing, Hi-C sequencing, and improved bioinformatic tools are ushering in an era where complete genome assembly will become common for species with few or no classical genetic resources. There are no guidelines for how to proceed in such cases. Ideally, such genomes would be sequenced by two different methods so that one assembly serves as confirmation of the other; however, cost constraints make this approach unlikely. Overreliance on synteny as a means of confirming and ordering contigs will lead to compounded errors. Optical mapping is an accessible and relatively mature technology that can be used for genome assembly validation. We discuss how optical mapping can be used as a validation tool for genome assemblies and how to interpret the results. In addition, we discuss methods for using optical map data to enhance genome assemblies derived from both traditional sequence contigs and Hi-C pseudomolecules.

A paradigm shift occurs within each research community when the genome of their study organism is sequenced. Designing and executing the assembly is generally a collaborative effort, and the display and annotation of the sequence can become a foundation for the research community. The genome sequence opens doors to subsequent comparative, evolutionary, and translational research efforts. This process has unfolded in community after community, starting with Arabidopsis thaliana, rice (Oryza sativa), and maize (Zea mays) and rapidly extended into all the major food, fiber, and energy crops. As sequencing costs continue to drop, communities that began with one reference assembly are moving to multiple assemblies (Bolger et al., 2014; Kawakatsu et al., 2016). High-quality assemblies dramatically expand the repertoire and robustness of analyses that can be performed and provide the foundation for subsequent laboratory experimentation.

The process of genome assembly can be divided into two phases: (1) sequence assembly (Berlin et al., 2015; Antipov et al., 2016) and (2) genome assembly (Goff et al., 2002; Schnable et al., 2009; Schmutz et al., 2010; Paterson et al., 2012). The sequence assembly phase includes using all of the available sequence data from short reads, mate pairs, and long reads to create contigs and scaffolds. Long-read single molecule sequencing technologies have made it possible to dramatically extend the length of sequence contigs, often including large portions of entire chromosomes (Michael et al., 2017). The genome assembly phase includes the integration of additional information such as prior assemblies, genetic maps, or Hi-C read pairs to order and orient the contigs into “pseudomolecules” that are representations of the chromosomes. The genome assembly phase is particularly challenging and is difficult to validate. Full replication of de novo sequence assembly using independent efforts could in principle be used for assembly validation. However, genome sequencing projects generally consume all available time and resources generating high-quality data for a single assembly. Standards for assembled genomes have been previously proposed ad interim (Blakesley et al., 2004; Chain et al., 2009), though low-quality draft sequences continue to be published.

The problem of genome validation takes on additional importance as researchers move into species that lack basic genetic resources. Indeed, the sequencing and assembly methods required to achieve a high level of contiguity are well within reach of many laboratories, including those working with trees (Neale et al., 2014), minor crops (Clouse et al., 2016; Jarvis et al., 2017), or ecological systems (Martínez-García et al., 2016; Olsen et al., 2016; Tang et al., 2016; Vining et al., 2017). If a research group invests in mate-pair or long-read sequencing at high depth, it is natural to proceed to genome assembly, even though the assembly may never be used for map-based cloning or genetic analysis in the traditional sense. In these cases, synteny relationships have been used for gross genome assembly validation (Gan et al., 2016; Jin et al., 2016), yet when there are differences between two genome sequences, it is unclear if they arose from assembly error or biological differences. There is also a real risk that a synteny-assisted genome assembly will be used as a reference to create another synteny-assisted assembly in a third species, compounding errors and drifting further from biological reality. Comparison of assemblies from different accessions of a single species (pan-genome analyses) address the issue of validation to a degree, but a small number of genomes (e.g., 2–15 genomes within a monophyletic branch) still have limited inference power for untangling technical and biological differences (Li et al., 2010; Gan et al., 2011; Hirsch et al., 2014; Li et al., 2014). Here, we highlight the use of optical mapping as an alternative, affordable method for sequence assembly validation that is independent of traditional sequencing and synteny-based methods.

VALIDATION OF SEQUENCE ASSEMBLY

In this Commentary, we do not present how to create a de novo genome assembly or describe when a genome is completed (Veeckman et al., 2016). Instead, we look forward to continued technical innovations in genome assembly and propose that next-generation optical maps may be used as a standard for assembly validation. Optical maps derived from high throughput nano-channels (Bionano optical detection or Nabsys electronic detection) offer a relatively straightforward and independent assessment of any genome assembly that claims to provide chromosome-level contiguity. The resulting whole-genome alignment metric can be used by reviewers and readers alike to quickly assess sequence assembly quality.

Two current technologies (Bionano Genomics and Nabsys) use nick-based labeling to generate maps of individual DNA molecules. The characteristics and features of Bionano technology have been reviewed elsewhere (Levy-Sakin and Ebenstein, 2013; Tang et al., 2015; Chaney et al., 2016; Yuan et al., 2017). Briefly, the method involves purifying high molecular weight (HMW) DNA and treating with a nickase, a modified restriction enzyme that creates single stranded nicks. Nickases target specific 6- or 7-bp nucleotide recognition sites, and these sites are strand-repaired with fluorescent nucleotides. The labeled DNA molecules are then passed through nanochannel arrays where images are iteratively collected and converted into a digital format that reflects the nicking patterns on each molecule. Data are usually collected at 50 to 150× coverage and subsequently assembled to create restriction map contigs. The assembly algorithm identifies overlapping fingerprints in an overlap-layout-consensus approach, yet the assembly process is one step removed from DNA sequence because only nick-length patterns (not sequence) are used to find matches between molecules. Similar to sequence analysis, matched and overlapping nick patterns can be condensed into an aligned consensus pattern. The challenge for optical map technologies, as well other long read technologies, is generating high-quality HMW DNA. Tissue quality is very important in these protocols (young, unstressed tissue) and each DNA preparation will differ slightly in its labeling efficiency. This differs from Hi-C-based scaffolding technologies (see below), which do not require purified HMW DNA and do not suffer from these limitations.

The characteristics and features of the Nabsys platform have not been reviewed elsewhere, though descriptions of the technology (Oliver et al., 2017) and verification of structural variants in the human genome are publically available (Kaiser et al., 2017). Nicking enzymes are used to attach a proprietary tag to the HMW DNA and the DNA+tags are coated with RecA protein. The RecA-coated DNA is moved through a nanochannel with detectors that measure the change in electrical resistance in the nanochannel. A spike in resistance identifies the tag as it passes the detector and the time between spikes measures the base-pair distance between tags. This platform has an expected availability date in 2018. Both the Bionano and Nabsys systems produce nick-based physical map assemblies based on overlapping nick patterns (for simplicity, we retain the term optical map for both technologies in this commentary). To match optical map contigs with DNA sequence, the sequence is converted into a restriction map format. Through positive matches between the nick patterns of the optical map contigs and in silico nick patterns from the DNA sequence, the optical map can independently validate the base-pair distances between nick sites in the DNA sequence.

The widely used N50 parameter describes how much of an assembly is composed of segments larger than a certain size, where “N” is the contig or scaffold size and “50” is the percentage of the assembly length. The N50 term can be applied to contigs (segments with continuous sequence), scaffolds (contiguous but with N-filled gaps), or optical map contigs (no sequence at all). An N50 of 1 Mb indicates that 50% of the assembly is contained in contigs (or scaffolds) larger than 1 Mb; many contigs will be larger than 1 Mb, but a much greater number will be smaller. Aligning two assemblies is a process of comparing the nicking site distribution of the optical map to the in silico distributions from the sequence assembly. The degree of alignment generated by such a comparison provides a direct measure of the accuracy of both assemblies, although the power of the comparison increases with N50. In practice, only assemblies with megabase-scale N50s can be validated using optical mapping technology since contigs/scaffolds smaller than 100 kb generally do not have enough nick information to be confidently aligned.

The quality of the optical map is heavily dependent on the quality of the HMW DNA used to prepare it, and the purity of the extracted HMW DNA affects the efficiency of the nicking reaction. If two recognition sites of a nicking enzyme happen to be closely positioned on opposite strands of a DNA molecule, the enzyme can create a double-strand break instead of two nicks, with the effect of truncating contigs (these are called “fragile sites”). The impact of double strand breaks can be minimized by generating two different optical maps with different nicking enzymes to create more a complete hybrid scaffold (Figure 1). Dual-nick assemblies have been shown to increase the assembly N50 of hummingbirds and humans by 2- to 3-fold (Bionano Genomics, 2017).

Figure 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1.

An 8.5-Mb Region of Chromosome 1 from the Gossypium raimondii Genome Assembly and Associated Alignment Data.

At the top, a hybrid assembly is aligned to two scaffolds of genomic sequence, six BspQI Bionano contigs, and eight BssSI Bionano contigs. Horizontal bars from left to right represent the chromosome and contigs, while colored vertical bands within the bars represent nick sites (BspQI = orange [matched]/red [unmatched], BssSI = red [matched]/green [unmatched]). Gray lines connect matched nick sites between the hybrid scaffold and the contigs. A region of a Bionano contig (upper red box) is expanded to illustrate its individual alignment to the genome sequence (green bar, labeled sequence contig). A region of the same Bionano contig (lower red box) is further expanded to illustrate the consensus contig containing individually nicked, labeled, and assembled DNA molecules. Tick marks represent 400, 100, and 50 kb on the top, middle, and bottom scales, respectively. The blue individual molecules overlap a single BssSI nick site (asterisk). Red individual molecules do not overlap the selected nick site.

We have aligned Bionano genome assemblies to several sequenced genomes using the runCharacterize method from Bionano (Table 1). For example, aligning optical maps to the genome assemblies from two maize inbreds resulted in a high level of congruence between uniquely mapped Bionano consensus molecules and the assembled sequence (96% and 98% mapping rate of B73 and W22, respectively; Jiao et al., 2017). The very high percentage of alignment between the maize optical and sequence maps can in part be attributed to the high N50 of their respective sequence assemblies. In rice, Bionano contigs aligned to the Nipponbare reference sequence with a 96% mapping rate (Chen et al., 2017). However, these are exceptional cases, and most whole-genome alignments using Bionano data are closer to 85%. Comparing the optical map data of tetraploid cotton (Gossypium hirsutum) TM-1 to one draft genome sequence (Zhang et al., 2015) resulted in 85% alignment, and aligning to another draft genome of the same line (Li et al., 2015) resulted in 75%. These imperfect validations are the result of errors in the Bionano assembly, the sequence assembly, or both.

View this table:
  • View inline
  • View popup
Table 1. Recent Results of Physical Maps Aligned to Their Respective Reference Genomes

The comparison of data from different genome assembly projects also highlights some of the limitations associated with using nick-based physical map data for validation (Table 1). Note that the optical map length often differs slightly in size from the assembled genome sequence. The differences can generally be ascribed to repetitive regions such as nucleolus organizing regions and tandem repeat arrays (centromeres or telomeres), but may also be caused by low quality assemblies on either side of the comparison. For example, in rice (MH63), we found that five large Bionano contigs erroneously mapped to a single repetitive region of centromere 8. Correctly mapping these molecules to the genome increased the overall alignment from 85.6 to 87.1%. Similarly, in Gossypium herbaceum (A-genome cotton), nearly all of the centromere-spanning physical map contigs initially mapped to a single chromosome (Figure 2). This was because (1) many of the centromeric repeats were collapsed during sequence assembly and (2) the regular spacing of BssSI sites in the cotton regions had the lowest P value match scores of any local match between those Bionano contigs and the genome sequence. To accurately map these contigs, we masked the repetitive regions and used the flanking regions for appropriate placement of the physical map contigs.

Figure 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 2.

An Illustration of Bionano Contigs Likely Spanning Centromeric Regions in the G. herbaceum Reference.

The best match between the reference sequence and the Bionano molecules is determined by the lowest local P values. The repeats in one sequence contig on this chromosome are very regularly spaced resulting in significant matches with several Bionano contigs, despite those same contigs also having flanking regions that match elsewhere to the genome sequence. Consequently, Bionano molecules spanning the centromeric region were mapped to this genomic location. The top colored bar illustrates the sequence contigs that were concatenated to form the pseudomolecule chromosome. Bionano contigs are illustrated as cyan bars with a light blue coverage plot and many dark blue vertical BssSI matches to the genome sequence. Pink elipses illustrate the regions matching the repeat (inset right). Gray lines connect the matched nick sites between the reference sequence and Bionano contigs. A closer view of the putative centromeric regions illustrates the match of one BssSI site to multiple matches (red lines) in the different contigs (inset left).

In many cases, plants being considered for genome assembly retain a significant amount of heterozygosity. Although most current technologies have been designed to accommodate the possibility of heterozygosity, it remains a significant challenge to identify heterozygous regions and incorporate them into a final assembled product. Where there is sufficient coverage and polymorphism to differentiate heterozygous regions, two haplotypes will be assembled separately (sequence or optical map data). This has the effect of inflating the size of the assembled genome by creating a separate contig for each heterozygous polymorphic region. However, in practice, some polymorphic regions will be collapsed into a single haplotype depending on specific assembly parameters. Both FALCON (PacBio sequence assembly) and Bionano are developing haplotype detection methods, but the process of sorting out allelic contigs remains a difficult and labor-intensive process.

ENHANCEMENT BY COMBINATION WITH OTHER METHODS INCLUDING Hi-C PROXIMITY LIGATION

In cases where a sequence and an optical assembly are available, it makes sense to integrate the assemblies into a single “hybrid” assembly using “hybridScaffold.pl.” This process joins contigs and creates gaps with approximated sizes based on nick distances. Hybrid scaffolding also identifies assembly conflicts, which are the result of either improperly assembled contigs or actual mismatches between the optical assembly and sequence assembly when different accessions/species are used. Sequence chimeras are an unavoidable outcome of assembly in large-genome species. Chimeric contigs can be resolved (manually or automatically) by cutting or trimming the sequence contigs or the optical maps, or both. The resulting hybrid assembly is a more accurate representation of the genome than either individual assembly alone, as it includes structural information from the optical map, corrected chimeric scaffolds, and generally a longer N50 than either input assembly. For example, hybrid scaffolding was used to enhance the genome assemblies of amaranth (Amaranthus hypochondriacus; Clouse et al., 2016), barley (Hordeum vulgare; Mascher et al., 2017), and quinoa (Chenopodium quinoa; Jarvis et al., 2017). In amaranth, the hybrid assembly reduced the number of scaffolds from 343 to 241 and nearly doubled the final scaffold N50 by making several key connections between existing large scaffolds. Similar outcomes were observed in other genomes.

Hybrid scaffolding becomes even more powerful when combined with other technologies, such as the 10× Chromium system (Weisenfeld et al., 2017) or Hi-C-based methods (Burton et al., 2013; Korbel and Lee, 2013). Hi-C is a relatively new approach that is gaining rapid acceptance because of the resulting useful arrangements of contigs in chromosome-scale scaffolds (Korbel and Lee, 2013; Bickhart et al., 2017; Dudchenko et al., 2017; Mascher et al., 2017). Hi-C was originally developed to detect intra- and interchromosomal interactions such as those between enhancers and promoter regions, but it has proven to be useful for long-range scaffolding of sequence contigs as well. Its use in scaffolding relies on the distance-dependent decay of physical interaction frequencies that explains much (though not all) of the observed interaction patterns. The process involves cutting chromatin with restriction enzymes, biotin labeling, and religating the ends, then sequencing the biotin-labeled regions. Hi-C identifies physical contacts only and is a powerful complement to a sequence assembly with an excellent contig N50. Unlike mate-pair sequencing, Hi-C has the ability to scaffold a continuous range of distances using a log-likelihood (LOD) function that compares scaffolding results on a statistical basis. The results are presented as concatenations of contigs in the most likely order and orientation based on the highest LOD scores. Because the process is likelihood based, heterozygous regions can be represented by separate, adjacent contigs.

Hi-C data can be generated in-house or through a service provider such as Dovetail Genomics or Phase Genomics. While it is a powerful method, it is important to note that Hi-C scaffolding data include structural DNA linkages (i.e., mate-pair-like linkages) and biological linkages (i.e., loci colocalized in the cell on the same or different chromosomes) from a collection of millions of nuclei from different tissue types and cell division stages. Often telomeric and centromeric regions of different chromosomes are colocalized in the nucleus. Such data are inherently messy since the highest likelihood based on colocalization frequency is assumed to be correct for pseudomolecule construction, while several nearly-as-likely orders or orientations may have also been calculated. This is particularly true for small contigs (<100 kb) that have a small number of Hi-C linkages and limited power for the respective LOD scores. One key difference between Hi-C-based scaffolding and optical map scaffolding lies in how sequence gaps are handled. In Hi-C scaffolding, the sequence gaps are marked by arbitrarily sized and N-filled gaps (Hi-C provides proximity information only), whereas in optical maps, the gaps are filled with Ns to lengths that estimate their actual size. Since there are no gaps of significance in a Hi-C scaffold, optical maps align to them well as long as the contigs are ordered correctly (Bionano software will call insertions where the gap sizes are not congruent).

The combination of large sequence contigs (3–10 Mb), optical maps, and Hi-C scaffolding data provide a very powerful set of resources. When integrating these data, it is best to scaffold the sequence contigs with Hi-C data first to create initial pseudomolecules, then to verify and modify the result using the optical map data. Optical maps generally do not align to short contigs unless they are scaffolded with additional sequence. Placing Hi-C scaffolding first in the workflow allows a proportion of correctly placed short contigs to be confirmed or adjusted by the optical maps. Generating a hybrid scaffold from this final product accommodates the best features of both systems (Bickhart et al., 2017). However, even in the best assemblies, some discrepancies will require adjustments or corrections. Appropriate adjustments have strong global support from Hi-C data (i.e., clusters of linkage data) and strong local support for order and orientation from Bionano contigs (Figure 3). The number of adjustments generally is proportional to the N50 of the underlying sequence contigs. Some genomic regions are easily corrected, others require multiple iterations to untangle and resolve discrepancies (Figure 3), and others will likely remain unresolved and may require local reassembly of the underlying DNA sequence.

Figure 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 3.

Sequence Contigs from G. herbaceum Chromosome 4 Ordered and Oriented into Pseudomolecules by the Hi-C Methodology (as Assembled by PhaseGenomics).

(A) The first row is a colored bar that represents the concatenated contigs based on clustering and orientation likelihood ratios of Hi-C data. The second row is the same sequence that has been digested in silico and displayed as the “Reference with nicks” with a ruler in megabases. The two following rows represent Bionano molecules aligned to the nick sites of the Reference sequence. Matches are illustrated by gray lines connecting the nick sites between the reference sequence and the Bionano molecules.

(B) Three steps of rearrangement can locally reorganize the sequence contigs so that they agree with the Bionano contigs that have substantive evidence of contig order. The contigs are treated as blocks (of one or more contigs) and the blocks can be inverted, moved, or both.

(C) Once the contigs have been reordered and oriented based on the Bionano evidence, the sequence is redigested in silico and the Bionano contigs are realigned to the final version of the sequence. Both sequence and Bionano contigs agree in order and orientation after corrections to the genome sequence.

To obtain the highest quality assemblies, users can opt to include high coverage of long-read technologies such as those offered by Oxford Nanopore or PacBio, which can generate read lengths with N50s of 10 kb or higher, and contig assemblies with N50s over 10 Mb. Michael et al. (2017) describe the assembly of an entire Arabidopsis genome using Oxford Nanopore technology and confirm the assembly with Bionano optical mapping. PacBio frequently has been used to assemble the genomes of model species, and it combines well with optical mapping (Jiao et al., 2017), but sufficient coverage can be cost-limiting for many projects.

HOW TO INTERPRET WHOLE-GENOME VALIDATION DATA

What are the metrics to consider for genome assembly validation? As previously discussed, errors and chimeras may exist for both optical maps and sequence assemblies; thus, 100% congruence is not expected. By aligning Bionano assemblies with different published genomes, we empirically identified ∼85% to be a reasonable level of initial alignment between the two assemblies (Bionano data and DNA sequence). We do not suggest that researchers and reviewers use this number as a hard threshold; rather, we suggest it be used as a soft threshold with a significant amount of subjectivity. For example, percentages higher than 85% should be encouraged, while percentages between 70 and 85% should be reasonably justified. Good reasons for low alignment might be that different accessions were used for the two maps or that the genome has exceptionally high repeat content.

Percentages lower than 70 to 85% could provide a reviewer the basis to suggest sequence assembly improvement, optical map assembly improvement, or further justification. In genomes with low (<70%) alignment between the optical map and sequence assembly, researchers can look at other parameters to identify the potential sources of error. Chimeric contigs and the resulting conflicts detected during hybrid scaffolding are one reason for low alignment. Conflict resolution can be ignored (i.e., flagged only), manually adjusted, or automatically resolved. Chimeric contigs can occur in either the sequence or optical assemblies. When one assembly is inferior to the other, there will be more conflicts assigned to it than the other. Each assembly comparison is unique and neither assembly is perfect in eukaryotic plant genomes. For example, a G. herbaceum Bionano assembly aligned to its draft sequence assembly revealed 923 Bionano conflicts and 53 sequence conflicts, suggesting that the Bionano assembly could be improved by increasing specificity (decreasing P value for matches during assembly) or by closely reviewing and omitting unresolved conflicts. Manual editing of the G. herbaceum scaffolds reduced the conflicts (to 609 Bionano conflicts and 29 sequence conflicts) and improved the percent mapping (from 89.0 to 90.3). Several hundred or thousands of conflicts would be a red flag indicator that one or both assemblies are of poor quality. Consequently, this information could be used to assess the need for researchers to revisit HMW DNA preparation, data collection, or assembly. If Hi-C data are used to create scaffolded pseudomolecules, conflicts between the Hi-C and optical map data are expected during the initial alignment. An iterative process of local contig adjustments (where groups of one or more contigs are considered individually) can be used to order and orient the Hi-C sequence contigs based on the optical map (Figure 3). Currently, these adjustments are made manually but it is likely that automated pipelines will be developed in the future.

In summary, a respectable draft genome sequence will have matches to a respectable optical map assembly. Because both assemblies are independently constructed, each has their own source of assembly limitations and errors. The optical map assembly can be used to independently validate the distances between nick sites in the DNA sequence assembly. If the validation percentage is high, optical map data can be directly used in hybrid scaffolding to improve the overall assembly. If the validation percentage is low, researchers can use the data to assess the need for reassembly, additional data collection, or both. If the validation is borderline (70–85%), the lack of congruence might be justified (e.g., accession or species differences), or it might be addressed through reassembly and conflict resolution. Resolving conflicts might be sufficient to improve the overall alignment, although this depends on the genome and the quality of the assemblies. If Hi-C scaffolding was used, adjustments to the local order of contigs leverage the strengths of the optical map and do not necessarily invalidate the ordering of likelihoods used for pseudomolecule construction. We anticipate dramatic improvements in all areas of sequencing and genome assembly in the coming years, but until chromosomes can be confidently assembled from end to end from sequence data alone, optical mapping will continue to have an important niche.

Acknowledgments

R.K.D. thanks Florian Jupe, Jinghua Shi, and Joseph Ecker for providing training in optical mapping. Alex Freeman identified and adjusted the genomic region described in Figures 2 and 3. Graduate students Alex, Chris Hanson, and Evan Long created the Bionano genome assemblies in Table 1 (except for maize, created by R.K.D.). We thank Mingcheng Luo, Jonathan Gent, Jianing Lui, Shawn Sullivan (Phase Genomics), and Sven Bocklandt (Bionano) for valuable comments on an early version of this manuscript. Work in the Udall laboratory was funded by a grant from the National Science Foundation (1339412). Another grant from the National Science Foundation (1444514) funds work in the Dawe laboratory.

AUTHOR CONTRIBUTIONS

J.A.U. and R.K.D. jointly conceived and wrote the article. J.A.U. prepared the figures.

Footnotes

  • www.plantcell.org/cgi/doi/10.1105/tpc.17.00514

  • ↵[OPEN] Articles can be viewed without a subscription.

  • Received July 6, 2017.
  • Revised November 21, 2017.
  • Accepted December 20, 2017.
  • Published December 20, 2017.

References

  1. ↵
    1. Antipov, D.,
    2. Korobeynikov, A.,
    3. McLean, J.S.,
    4. Pevzner, P.A.
    (2016). hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32: 1009–1015.
    OpenUrlCrossRefPubMed
  2. ↵
    1. Berlin, K.,
    2. Koren, S.,
    3. Chin, C.S.,
    4. Drake, J.P.,
    5. Landolin, J.M.,
    6. Phillippy, A.M.
    (2015). Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33: 623–630.
    OpenUrlCrossRefPubMed
  3. ↵
    1. Bickhart, D.M., et al
    . (2017). Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat. Genet. 49: 643–650.
    OpenUrl
  4. ↵
    1. Bionano Genomics
    (2017). Hybrid Scaffolding Improves Genome Assembly Accuracy and Contiguity. In White Paper Series (San Diego, CA: Bionano Genomics).
  5. ↵
    1. Blakesley, R.W., et al.; NISC Comparative Sequencing Program
    (2004). An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 14: 2235–2244.
    OpenUrlAbstract/FREE Full Text
  6. ↵
    1. Bolger, M.E.,
    2. Weisshaar, B.,
    3. Scholz, U.,
    4. Stein, N.,
    5. Usadel, B.,
    6. Mayer, K.F.X.
    (2014). Plant genome sequencing - applications for crop improvement. Curr. Opin. Biotechnol. 26: 31–37.
    OpenUrlCrossRefPubMed
  7. ↵
    1. Burton, J.N.,
    2. Adey, A.,
    3. Patwardhan, R.P.,
    4. Qiu, R.,
    5. Kitzman, J.O.,
    6. Shendure, J.
    (2013). Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31: 1119–1125.
    OpenUrlCrossRefPubMed
  8. ↵
    1. Chain, P.S., et al.; Genomic Standards Consortium Human Microbiome Project Jumpstart Consortium
    (2009). Genomics. Genome project standards in a new era of sequencing. Science 326: 236–237.
    OpenUrlAbstract/FREE Full Text
  9. ↵
    1. Chaney, L.,
    2. Sharp, A.R.,
    3. Evans, C.R.,
    4. Udall, J.A.
    (2016). Genome mapping in plant comparative genomics. Trends Plant Sci. 21: 770–780.
    OpenUrl
  10. ↵
    1. Chen, P.,
    2. Jing, X.,
    3. Liao, B.,
    4. Zhu, Y.,
    5. Xu, J.,
    6. Liu, R.,
    7. Zhao, Y.,
    8. Li, X.
    (2017). BioNano genome map resource for Oryza sativa ssp. japonica and indica and its application in rice genome sequence correction and gap filling. Mol. Plant 10: 895–898.
    OpenUrl
  11. ↵
    1. Clouse, J.W.,
    2. Adhikary, D.,
    3. Page, J.T.,
    4. Ramaraj, T.,
    5. Deyholos, M.K.,
    6. Udall, J.A.,
    7. Fairbanks, D.J.,
    8. Jellen, E.N.,
    9. Maughan, P.J.
    (2016). The amaranth genome: genome, transcriptome, and physical map assembly. Plant Genome 9: 1.
    OpenUrl
  12. ↵
    1. Dudchenko, O.,
    2. Batra, S.S.,
    3. Omer, A.D.,
    4. Nyquist, S.K.,
    5. Hoeger, M.,
    6. Durand, N.C.,
    7. Shamim, M.S.,
    8. Machol, I.,
    9. Lander, E.S.,
    10. Aiden, A.P.,
    11. Aiden, E.L.
    (2017). De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356: 92–95.
    OpenUrlAbstract/FREE Full Text
  13. ↵
    1. Gan, X., et al
    . (2016). The Cardamine hirsuta genome offers insight into the evolution of morphological diversity. Nat. Plants 2: 16167.
    OpenUrlCrossRef
  14. ↵
    1. Gan, X., et al
    . (2011). Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477: 419–423.
    OpenUrlCrossRefPubMed
  15. ↵
    1. Goff, S.A., et al
    . (2002). A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92–100.
    OpenUrlAbstract/FREE Full Text
  16. ↵
    1. Hirsch, C.N., et al
    . (2014). Insights into the maize pan-genome and pan-transcriptome. Plant Cell 26: 121–135.
    OpenUrlAbstract/FREE Full Text
  17. ↵
    1. Jarvis, D.E., et al
    . (2017). The genome of Chenopodium quinoa. Nature 542: 307–312.
    OpenUrl
  18. ↵
    1. Jiao, Y., et al
    . (2017). Improved maize reference genome with single-molecule technologies. Nature 546: 524–527.
    OpenUrl
  19. ↵
    1. Jin, J., et al
    . (2016). Draft genome sequence of an elite Dura palm and whole-genome patterns of DNA variation in oil palm. DNA Res. 23: 527–533.
    OpenUrlCrossRefPubMed
  20. ↵
    1. Kaiser, M.D.,
    2. Davis, J.R.,
    3. Grinberg, B.S.,
    4. Oliver, J.S.,
    5. Sage, J.M.,
    6. Seward, L.,
    7. Bready, B.
    (2017). Automated structural variant verification in human genomes using single-molecule electronic DNA mapping. bioRxiv doi/10.1101/140699.
    1. Kawahara, Y., et al
    . (2013). Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice 6: 4.
    OpenUrlCrossRefPubMed
  21. ↵
    1. Kawakatsu, T., et al.; 1001 Genomes Consortium
    (2016). Epigenomic diversity in a global collection of Arabidopsis thaliana accessions. Cell 166: 492–505.
    OpenUrl
  22. ↵
    1. Korbel, J.O.,
    2. Lee, C.
    (2013). Genome assembly and haplotyping with Hi-C. Nat. Biotechnol. 31: 1099–1101.
    OpenUrlCrossRefPubMed
  23. ↵
    1. Levy-Sakin, M.,
    2. Ebenstein, Y.
    (2013). Beyond sequencing: optical mapping of DNA in the age of nanotechnology and nanoscopy. Curr. Opin. Biotechnol. 24: 690–698.
    OpenUrlCrossRefPubMed
  24. ↵
    1. Li, F., et al
    . (2015). Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution. Nat. Biotechnol. 33: 524–530.
    OpenUrlCrossRefPubMed
  25. ↵
    1. Li, R., et al
    . (2010). Building the sequence map of the human pan-genome. Nat. Biotechnol. 28: 57–63.
    OpenUrlCrossRefPubMed
  26. ↵
    1. Li, Y.H., et al
    . (2014). De novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits. Nat. Biotechnol. 32: 1045–1052.
    OpenUrlCrossRefPubMed
    1. Liu, X., et al
    . (2015). Gossypium barbadense genome sequence provides insight into the evolution of extra-long staple fiber and specialized metabolites. Sci. Rep. 5: 14139.
    OpenUrl
  27. ↵
    1. Martínez-García, P.J., et al
    . (2016). The walnut (Juglans regia) genome sequence reveals diversity in genes coding for the biosynthesis of non-structural polyphenols. Plant J. 87: 507–532.
    OpenUrl
  28. ↵
    1. Mascher, M., et al
    . (2017). A chromosome conformation capture ordered sequence of the barley genome. Nature 544: 427–433.
    OpenUrlCrossRefPubMed
  29. ↵
    1. Michael, T.P.,
    2. Jupe, F.,
    3. Bemm, F.,
    4. Motley, S.T.,
    5. Sandoval, J.P.,
    6. Loudet, O.,
    7. Weigel, D.,
    8. Ecker, J.R.
    (2017). High contiguity Arabidopsis thaliana genome assembly with a single nanopore flow cell. bioRxiv doi/10.1101/149997.
  30. ↵
    1. Neale, D.B., et al
    . (2014). Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies. Genome Biol. 15: R59.
    OpenUrlCrossRefPubMed
  31. ↵
    1. Oliver, J.S., et al
    . (2017). High-definition electronic genome maps from single molecule data. bioRxiv doi/10.1101/139840.
  32. ↵
    1. Olsen, J.L., et al
    . (2016). The genome of the seagrass Zostera marina reveals angiosperm adaptation to the sea. Nature 530: 331–335.
    OpenUrlCrossRefPubMed
  33. ↵
    1. Paterson, A.H., et al
    . (2012). Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres. Nature 492: 423–427.
    OpenUrlCrossRefPubMed
  34. ↵
    1. Schmutz, J., et al
    . (2010). Genome sequence of the palaeopolyploid soybean. Nature 463: 178–183.
    OpenUrlCrossRefPubMed
  35. ↵
    1. Schnable, P.S., et al
    . (2009). The B73 maize genome: complexity, diversity, and dynamics. Science 326: 1112–1115.
    OpenUrlAbstract/FREE Full Text
  36. ↵
    1. Tang, C., et al
    . (2016). The rubber tree genome reveals new insights into rubber production and species adaptation. Nat. Plants 2: 16073.
    OpenUrl
  37. ↵
    1. Tang, H.,
    2. Lyons, E.,
    3. Town, C.D.
    (2015). Optical mapping in plant comparative genomics. Gigascience 4: 3.
    OpenUrlCrossRefPubMed
  38. ↵
    1. Veeckman, E.,
    2. Ruttink, T.,
    3. Vandepoele, K.
    (2016). Are we there yet? Reliably estimating the completeness of plant genome sequences. Plant Cell 28: 1759–1768.
    OpenUrlAbstract/FREE Full Text
  39. ↵
    1. Vining, K.J.,
    2. Johnson, S.R.,
    3. Ahkami, A.,
    4. Lange, I.,
    5. Parrish, A.N.,
    6. Trapp, S.C.,
    7. Croteau, R.B.,
    8. Straub, S.C.K.,
    9. Pandelova, I.,
    10. Lange, B.M.
    (2017). Draft genome sequence of Mentha longifolia and development of resources for mint cultivar improvement. Mol. Plant 10: 323–339.
    OpenUrl
  40. ↵
    1. Weisenfeld, N.I.,
    2. Kumar, V.,
    3. Shah, P.,
    4. Church, D.M.,
    5. Jaffe, D.B.
    (2017). Direct determination of diploid genome sequences. Genome Res. 27: 757–767.
    OpenUrlAbstract/FREE Full Text
    1. Yuan, D., et al
    . (2015). The genome sequence of Sea-Island cotton (Gossypium barbadense) provides insights into the allopolyploidization and development of superior spinnable fibres. Sci. Rep. 5: 17662.
    OpenUrlCrossRefPubMed
  41. ↵
    1. Yuan, Y.,
    2. Bayer, P.E.,
    3. Batley, J.,
    4. Edwards, D.
    (2017). Improvements in genomic technologies: Application to crop genomics. Trends Biotechnol. 35: 547–558.
    OpenUrl
  42. ↵
    1. Zhang, T., et al
    . (2015). Sequencing of allotetraploid cotton (Gossypium hirsutum L. acc. TM-1) provides a resource for fiber improvement. Nat. Biotechnol. 33: 531–537.
    OpenUrlCrossRefPubMed
PreviousNext
Back to top

Table of Contents

Print
Download PDF
Email Article

Thank you for your interest in spreading the word on Plant Cell.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Is It Ordered Correctly? Validating Genome Assemblies by Optical Mapping
(Your Name) has sent you a message from Plant Cell
(Your Name) thought you would like to see the Plant Cell web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Is It Ordered Correctly? Validating Genome Assemblies by Optical Mapping
Joshua A. Udall, R. Kelly Dawe
The Plant Cell Jan 2018, 30 (1) 7-14; DOI: 10.1105/tpc.17.00514

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Is It Ordered Correctly? Validating Genome Assemblies by Optical Mapping
Joshua A. Udall, R. Kelly Dawe
The Plant Cell Jan 2018, 30 (1) 7-14; DOI: 10.1105/tpc.17.00514
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Jump to section

  • Article
    • Abstract
    • VALIDATION OF SEQUENCE ASSEMBLY
    • ENHANCEMENT BY COMBINATION WITH OTHER METHODS INCLUDING Hi-C PROXIMITY LIGATION
    • HOW TO INTERPRET WHOLE-GENOME VALIDATION DATA
    • Acknowledgments
    • AUTHOR CONTRIBUTIONS
    • Footnotes
    • References
  • Figures & Data
  • Info & Metrics
  • PDF

In this issue

The Plant Cell: 30 (1)
The Plant Cell
Vol. 30, Issue 1
Jan 2018
  • Table of Contents
  • Table of Contents (PDF)
  • Cover (PDF)
  • About the Cover
  • Index by author
View this article with LENS

More in this TOC Section

  • Revisiting Criteria for Plant MicroRNA Annotation in the Era of Big Data
  • Widespread Contamination of Arabidopsis Embryo and Endosperm Transcriptome Data Sets
Show more COMMENTARY

Similar Articles

Our Content

  • Home
  • Current Issue
  • Plant Cell Preview
  • Archive
  • Teaching Tools in Plant Biology
  • Plant Physiology
  • Plant Direct
  • Plantae
  • ASPB

For Authors

  • Instructions
  • Submit a Manuscript
  • Editorial Board and Staff
  • Policies
  • Recognizing our Authors

For Reviewers

  • Instructions
  • Peer Review Reports
  • Journal Miles
  • Transfer of reviews to Plant Direct
  • Policies

Other Services

  • Permissions
  • Librarian resources
  • Advertise in our journals
  • Alerts
  • RSS Feeds
  • Contact Us

Copyright © 2021 by The American Society of Plant Biologists

Powered by HighWire