Translation of the third chapter of the book “Non-Invasive Prenatal Testing (NIPT) (Subtitle: Application of Genomics to Prenatal Testing and Diagnosis)” edited by Lieve Paage-Christiajens and Hans-Georg Klein (Academic Press, 2018) (Hiro Clinic NIPT tentative translation)
Chapter 3. Cell-free DNA-based NIPT technology and bioinformatics
Dale Musey, University of South San Francisco
Introduction: Evolution of sequencing technology
The term “next-generation sequencing” (NGS) raises questions such as what was “first-generation sequencing” and how is NGS similar to and different from first-generation sequencing. Sanger pioneered the development of first-generation DNA sequencing in the 1970s (References 1, 2). The sequencing method for which he became famous involved the successful extension of unextendable DNA bases in vitro using a cell growth machine. These modified bases were added at low concentrations under conditions that required minimal interaction with certain materials: (1) a high concentration of extendable bases, (2) the single-stranded DNA to be sequenced, (3) a short oligonucleotide primer that is complementary to the DNA template (where new bases may be synthesized), and (4) a DNA polymerase enzyme that directs the extension reaction. In Sanger’s early sequencing experiments, these four reactions occurred independently, each involving a single non-extendable base (A, T, G, or C). Whenever the polymerase randomly synthesizes one of the non-extendable bases into the nascent DNA molecule (e.g., a non-extendable G synthesizes opposite a C in the DNA template), it will stop synthesis, regenerating a neckless template. Crucially, because all nascent strands are anchored and propagated from the same oligonucleotide primer, the point at which extension stops, and therefore the length of the nascent DNA strand, are direct proxies for the base at the 3′ end of the molecule. Electrophoresis gels used to analyze the lengths of terminated molecules in each of the four reactions allow the sequence of the entire template to be inferred.
Sanger sequencing became slightly more scalable with the introduction of uniquely pigmented non-extendable bases (Fig. 1). Rather than partitioning the four reactions to obtain base-specific information, capillary electrophoresis combined with fluorescent pigment detectors allowed analysis of both the relative sizes of DNA fragments and the identity of terminated bases (Refs. 3-5). To criticize these instruments for their inability to measure is to overlook one of their not inconsiderable achievements: they were the very instruments that drove the early human genome sequence in the 1990s (Refs. 6-9). However, given the billions of dollars and time required, genome sequencing is likely to remain largely unusable in clinical practice unless a major technological breakthrough occurs.
Although NGS has revolutionized genome sequencing by overcoming many of the technical limitations of Sanger sequencing (10), the most well-known NGS methodology still shares many of its features with previous techniques. As we will discuss in more detail below, NGS also affects elongation termination and fluorescent bases, but it relies on the ability of DNA polymerase to attach a single base to a nascent DNA molecule at a time. Indeed, in many ways, an NGS experiment is like running millions or even billions of Sanger reactions in parallel (hence the nickname “massively parallel sequencing”).
The role of next-generation sequencing
Next-generation sequencing (NGS) machines are responsible for distilling specially created libraries of DNA molecules into sequences in long text files, each with a single line for each sequenced molecule. The mapping of molecules to text files with NGS machines is deployed across a range of research and clinical applications, from RNA sequencing in broccoli (11) to ribosome profiling (deep sequencing of ribosome-tagged mRNA fragments; 12) to DNA sequencing in pregnant women for NIPT testing (13). These NGS applications are primarily differentiated by the type of upstream DNA feeding technique, known as “library preparation.” Mirroring the various upstream preparation methods leads to a broad comparative analysis of downstream examples, one of which is the analytical technique used in NIPT, which will be covered in detail in Chapter 3. In this chapter, in addition to explaining how each instrument used in NGS sequences DNA, we will also focus on NIPT and look at the NIPT workflow from upstream to downstream.
Upstream sequencer
DNA extraction
cfDNA, as the name suggests, is not present in blood cells and must be extracted from plasma. cfDNA fragments are remnants of dead cells (Ref. 14). When cells undergo programmed cell death (known as apoptosis), a set of enzymes join up and digest genomic DNA (Ref. 15). These enzymes can only access DNA that is not confined to nucleosomes, which are made up of octamers of histone proteins that control gene expression and genome topology in cells (Ref. 17). The inaccessibility of nucleosomal DNA means that DNA fragments of less than 150 nucleotides circulating within nucleosomes survive the apoptotic process, and that DNA fragments that escape from dying cells form cfDNA, which can be sequenced to output what are called next-generation sequencing (NGS) reads, which we will discuss in more detail later.
To extract cfDNA from plasma, blood must first be centrifuged to separate it into plasma, buffy coat (containing the white blood cells), and red blood cells. Plasma makes up approximately 55% of total blood. When removing the plasma from the centrifuge, it is important to carefully remove the buffy coat, because too much maternal DNA in the white blood cells will dilute the rare cfDNA from the placenta, reducing or eliminating the sensitivity of detecting fetal aneuploidies.
Standard commercial DNA extraction techniques can purify plasma samples to a sufficient amount of cfDNA for analysis (18, 19). Plasma typically contains only 5–50 ng of concentrated cfDNA per ml, but this low amount of cfDNA in plasma is noteworthy because it has established the minimum blood volume requirement for cfDNA-based prenatal testing. If the blood volume is too low or the amount of DNA extracted is insufficient, the extracted sample will contain a low number of genome copies, which may prevent detection of small changes in fetal gene dosage. For example, an extracted sample containing only 10 genome copies is unlikely to detect a 2% change in the dosage of gene 21. Conversely, an efficient extraction method can extract enough genome equivalents to detect fetal chromosomal abnormalities even with low amounts of fetal genome fragments. The number of genome equivalents to be extracted during DNA extraction depends on the subsequent NIPT test. For whole chromosome sequencing (WGS), only a very small amount of cfDNA is required from any given blood sample, so a very low volume of blood can be drawn from a patient (13). This allows multiple attempts to extract DNA from a single blood sample. In contrast, targeted techniques such as single nucleotide polymorphisms (SNPs) require hundreds of genome equivalents per specific region, which allows the allelic balance to be measured with high precision (WGS and SNPs are discussed in more detail in Chapter 3) (20). This means that NIPT testing for SNPs typically requires a larger volume of blood than WGS.
Because the concentration of cfDNA is so low, it is not trivial to measure whether enough cfDNA has been extracted to perform NIPT. The amount of DNA extracted can usually be increased by performing polymerase chain reaction (PCR) before NGS. This means that even if the extraction is inefficient, it is possible to produce a large amount of DNA for sequencing. This means that the depth of the NGS is not the cause of the inefficiency of the extraction. Fortunately, the “complexity” of the sequenced data can provide insight into whether the extraction is inefficient. For example, in the case of WGS, efficient extraction results in DNA fragments with zero or one (usually zero) sequenced on the genome. This is because the sequence is a Poisson sample from a source pool that is rich in genomic information integrated material (Reference 21). However, in addition to the inefficiency of extraction, if the source pool of genomic information integrated material is thin, DNA fragments will be placed on chromosomes with a probability of zero or less than one, resulting in low complexity data. Conversely, if extraction is highly efficient, PCR may not be required to generate sufficient DNA for sequencing. Such “PCR-free” library preparations are likely to result in high library complexity. It is important to monitor the complexity of the NGS data to be sequenced so that the fetal aneuploidy data are statistically significant.
Library preparation
NGS machines can only sequence properly prepared DNA molecules (called libraries) as a whole. In particular, in the Illumina genome set that dominates the clinical field, all 3′ ends share a common sequence and all 5′ ends share a different sequence, but the DNA molecules in the library must have a common adapter on each side (sequences up to 50 nt, specified by the manufacturer) (22). By using common adapters to hold down the DNA molecules on both sides, the entire library can be effectively multiplied or extended with only a pair of primers. Such multiplication and extension can occur (1) upstream of the sequencer to keep data input sufficiently centralized (this step is additive); (2) in time, within the NGS machine just before sequencing during the “cluster multiplication” step (more on this later); or (3) during the time window between sequence reactions (also more on this later).
Although all of the 5′ and 3′ adapters are different, the most common process itself involves clever molecular physiology. First, all cfDNA molecules have truncated ends or short overhangs at one end, which are incubated with a polymerase enzyme that trims the 3′ overhang, fills in the 5′ overhang, and adds an adenine (A) base to the 3′ end, resulting in an A overhang (Figure 2). Finally, these DNA fragments are combined with a “Y adapter” and a ligase. The Y adapter has two single-stranded DNA ends, one of which is complementary (the stem of the Y) and the other is not (the branch of the Y). One of the two ends of the Y adapter is a T overhang, which may hybridize to the A overhang of the cfDNA fragment. The two ends of the Y adaptor are asymmetric, so that after ligation of the Y adaptor to the ends of the cfDNA molecule, each end has a common 5′-end adaptor and a 3′-end adaptor.
Sequencing bias is minimized if DNA fragments subjected to NGS are of the same length (Ref. 24). In most practical cases of NGS, a subsequent in vitro fragmentation reaction, presumably size-selected, produces fragments of comparable and acceptable size. However, in the case of NIPT, such a process is unnecessary because the in vivo DNA fragmentation process that occurs during apoptosis produces fragments that are extremely uniform in length, around 150 nt (Refs. 25, 26). In fact, the length of the in vivo cfDNA fragmentation process is so precise that even slight differences between placental nucleosomes and those from other tissues would result in cfDNA length variations that could be used to analyze fragments that are from fetuses (placental fragments are systematically shorter than nonplacental tissues) (Ref. 27).
cfDNA length can be captured by WGS NIPT but not by SNP NIPT due to differences in library preparation methods. WGS simply adds Y-adapters to unaltered cfDNA molecules (excluding those that have been abruptly stopped or adenine-tailed). WGS tests are better suited to this simple library preparation process, as their goal is to extract cfDNA from plasma. On the other hand, SNP NIPT does not provide insight into fetal aneuploidy, except for cfDNA fragments that overlap informative SNP sites. This requires enrichment of the fragments of interest, which must be collected in a multiplex polymerase chain reaction (PCR) (Reference 20). In multiplex PCR, hundreds or even thousands of different primers can be mixed together in a single PCR tube containing cfDNA extracted from a sample. Then, with the right primers and reaction conditions, fragments from a given site can be highly enriched for sequencing. Adapter sequences can be added directly to the multiplex PCR (in which case only multiplex PCR can generate NGS-compatible libraries) or multiplex PCR can be followed by Y-adapter ligation. Length information is lost in this reaction because the length of the amplicon is dictated by the primers, not the template, which is separate from the cfDNA fragments they amplify.
In NGS-based NIPT, barcoding is an important step in library preparation, which is required due to the format of the equipment (Reference 28). In Illumina sequencing, data is sold in the form of flow cells, where each flow cell contains hundreds of millions to billions of reads. This is a huge number of reads, far more than the number of reads required for a single sample. For this reason, it is more economical to load many samples onto one flow cell. This is called “multiplexing.”
However, unlike other testing devices such as qPCR, ELISA, and capillary sequencers that keep samples separate during the assay process, NGS flow cells do not separate samples at all during sequencing. This is why a “demultiplexing” mechanism is needed, so that NGS data can be re-separated into relevant population cohorts after sequencing. Demultiplexing is done using sample-specific barcodes, which are short DNA sequences (typically less than 6-8 nt) contained in a set of Y adaptors that are tailored to a library of a particular sample. It is important to note that the barcodes are unique for each sample, and that the barcodes of molecules from the same sample are the same. The NGS device produces one text file with the barcodes and another text file with the cfDNA fragments. (i.e., the first column of the barcode text file and the first column of the cfDNA fragment text file contain data for the same cfDNA molecule.) Using these files, even while real samples are being sequenced in multiple open-top dishes, the entire flowcell sequencing file can be split into sample-specific files, allowing the process of separating data by sample to be repeated computationally.
The role of next-generation sequencing: from molecular libraries to text files
While post-Sanger sequencing researchers may be divided on the use of the term “next-generation sequencing (NGS)” in the field of clinical genomics, NIPT, the term “NGS” effectively connotes Illumina-style sequencing, which is currently the dominant platform in the field. For this reason, we will describe the “integrated sequencing” process that Illumina performs on its sequencers.
Creating a cluster
To get an intuition for the Illumina NGS workflow, it is useful to recall from earlier comparisons with Sanger sequencing that both Sanger and Illumina NGS involve the process of determining DNA sequence by measuring fluorescent labels one at a time. Therefore, at its most fundamental, NGS instruments must be able to resolve individual molecules and capture the fluorescent labels that represent the molecule’s individual genomic sequence. The process of cluster generation ensures that individual molecules can be resolved and that the fluorescent labels of the individual molecules can be captured cleanly enough.
The first step in cluster generation (22) is to load a DNA library (chemically denatured to single strands) into a glass chamber called a flow cell. The surface of the flow cell is coated with oligonucleotides, which are homologous to the adapter sequences added to the cfDNA fragments during library preparation. The single stranded fragments are immobilized on the surface of randomly selected flow cells. The location of the fragments is important, as the DNA fragments will remain in the same location on the flow cell throughout the entire NGS process. The concentration of the library loading must be carefully measured. Too high a concentration will result in multiple library fragments occupying the same location in the flow cell, somewhat hindering the ability to reliably detect the fluorescent tags released by specific fragments. On the other hand, too low a concentration will underutilize the sequencing capacity of the flow cell, which will result in the sequencing depth per sample being too shallow to reliably detect aneuploidies.
The second step in cluster generation, called bridge amplification, is a polymerase chain reaction (PCR) that takes place on the surface of the flow cell. The DNA is hybridized to the surface of the flow cell, where it is further localized and amplified. This amplification is necessary because a single fluorescent base combined with a single molecule undergoing sequencing is too weak to be captured by the camera of the NGS instrument, no matter how much amplification is performed. To increase the fluorescent label to a detectable level, the original library fragment is copied hundreds of thousands of times to create a dense clone of DNA fragments called a “cluster”. Bridge amplification (schematically shown in Figure 4) uses oligonucleotides attached to the surface of the flow cell. These oligonucleotides act as primers for each bridge amplification. As the oligos are attached to the glass side, the single-stranded molecules bend and bridge to the oligos attached to the flow cell, which allows each subsequent round of extension. The final step in cluster generation is to use cleavage enzymes and chemical denaturation to remove the single-stranded fragments of each molecule (e.g., the pink primer attached to the flow cell), leaving behind single-stranded DNA molecules with identical sequences (see the top of Figure 5, where the single-stranded DNA molecule has the pink primer attached to the top and the blue primer attached to the flow cell).
Sequencing cycle
Once the clusters are enriched, the sequencing reaction begins (22). The first step is to load a sequencing primer into the flow cell. In this case, the primer rehashes the common sequence embedded in the adapter molecule of each fragment. The 3′ end of the primer is immediately adjacent to the cfDNA insert, so sequencing begins at the end of the cfDNA fragment. The NGS instrument then floods the flow cell with a reaction mix that includes fluorescently labeled non-extendable nucleotides and DNA polymerase (Figure 5). The polymerase extends from the end of the primer, incorporating more and more fluorescently labeled bases that are complementary to the template cfDNA molecule. Since these bases are non-extendable, extension can only occur at a single base, and continues until the molecule dies. At this point, the unincorporated nucleotides and the extension reaction mix are flushed out of the flow cell chamber and visualization begins. The entire flow cell is scanned by a camera (in recent Illumina machines, the flow cell is scanned both at the top and bottom), the clusters are identified, and the image is saved. The color of the clusters is likely to reflect the color of the bases just incorporated. The clusters are only visible during the bridge strengthening process. After imaging, a mixture of chemicals enters the flow cell, which removes the fluorescent moiety from the just incorporated bases, restoring their ability to extend. This restoration is crucial, as it allows each molecule to proceed with further rounds of extension and imaging. In fact, the cycle of extension, imaging, and recovery can be repeated hundreds of times (based on the user’s preferences), with additional cycles to decode the sequence of the clusters on the surface of the flow cell. The number of iterations determines the length of the reads used for mapping.
For NIPT applications, the number of sequencing cycles is usually low (25-36). Unlike other genomic tests that analyze sequenced molecules to identify novel genomic variables (hence the value of long reads), current cfDNA NIPT aneuploidy testing does not aim to discover novel genomic variables at the single base level. For SNP-based NIPT, multiplex primers can be made to match SNPs to the sequencing primers. For WGS-based NIPT, sequencing must be continued until the reads can be uniquely mapped. Both of these short read lengths are attractive for NIPT because of two objectives: (1) the time required for NGS is proportional to the time required for sequencing, so short read lengths allow for faster test reporting; and (2) short reads are less expensive than long reads, making them more affordable for testing.
Paired-end sequencing
In NGS machines, the sequence can only be determined at the ends of DNA fragments because the oligonucleotides that control the sequencing reaction anchor the adapters immediately adjacent to the DNA fragments (Ref. 22). Furthermore, because nucleotide elongation only proceeds from the 5′ to the 3′ end, the flow cell-anchored single-base DNA fragments resulting from the cluster enrichment process can only be sequenced from one end. However, a sequencing process called “paired-end” allows the sequence to be determined starting from both ends of the DNA fragment. As the name suggests, paired-end sequencing involves two rounds of the previously described single-sided sequencing, where each round is driven by a different primer and is differentiated from each other by a strand-switching mechanism. In the first step of this switching scheme (Figure 6), the double-stranded DNA in the flow cell is denatured. During this process, the single-stranded molecules originally present in the cluster are differentiated from the emerging single strands. The latter single strand is then created as a result of the synthesis of the primer away from the primer that was sequenced in the first round. The nascent strand that is not fixed to the flow cell is washed away by the flow cell, effectively returning the flow cell to its original position after the cluster reinforcement. However, to capture the sequence from the other side of the fragment, the reverse complementary strand from the original cluster must be merged so that the extension reaction can proceed from the 5′ to the 3′ end. The single strand fragments of the cluster are reverse complementary as a result of the first round of bridge reinforcement, but the cluster is created by bridge reinforcement, and two strands are present. A molecule is then introduced that cuts open the oligonucleotide that was fixed to the flow cell, thereby removing the original strand. This process prepares the cluster to be extended by sequencing the primer in the opposite direction, and allows sequencing information to be obtained from the other side of the DNA fragment.
Image analysis and sequencing quantitative analysis
We have seen how NGS machines convert molecular information, the code for DNA sequences, into a pile of images, but the target output is a text file of sequence information, not a high-resolution image. This final conversion is performed by software built into the sequencing device called “base calling” (Fig. 7; ref. 22). The software’s goal is to find each cluster in each image, tracking their location and color in the pile of images. In early sequencers, the clusters were randomly positioned across the entire slide. There were also four colors for the nucleotides. This meant that the machine captured and analyzed four images each time. However, newer machines use patterned flow cells. The chemistry of the clusters growing is different, and the molecular templates fill the many holes etched into the surface of the flow cell, so that the holes form a honeycomb pattern. This honeycomb pattern simplifies and therefore speeds up image analysis. The decision to use only two colors to code the bases also speeds up image analysis. By modifying the adenine base to have green and red fluorophores, all four bases can be determined by simply looking at the red and green: adenine is red and green, cytosine is red but not green, thymine is green but not red, and guanine has no color.
For each cluster it analyzes, the software issues both a “base call” (i.e., A, T, C, G or N if the base cannot be decoded) and a quality score. The quality score, which ranges from 1 to 40, indicates the confidence in the base call, and is an analogue measure of the digital representation of the bases. The quality score is important because it provides a bioinformatics filter for the sequence information. For example, a read that is composed entirely of poor quality bases would likely be one that should be ignored. Similarly, accounting for the quality score with a technology such as single nucleotide polymorphism (SNP)-based NIPT, which affects allele balance by identifying a single base (see Chapter 3 for a detailed description of the SNP-based algorithm), could greatly improve aneuploidy detection.
Once the sequence and cluster quality score are determined as a result of the image analysis, the information is written to a FASTQ file. The cluster is also given a name, which usually includes the cluster’s location in the flow cell. Together, these constitute the NGS “read”. Current NGS machines can measure billions of reads per flow cell, so FASTQ files are very large text files. Importantly, the creation of a FASTQ file means that the original purpose of the NGS machine is to convert the molecular library of DNA fragments into a text file-based sequence information.
Downstream of the sequencer: demultiplexing and alignment
Once sequencing decisions and base calling are complete, the next steps are demultiplexing and alignment. These are common processes for NGS-based NIPT and are performed prior to platform-specific analysis, which will be discussed in Chapter 3. The first of these two processes is demultiplexing, which assigns NGS reads to their original samples based on the sequence of their corresponding molecular barcodes. Barcode reads and cfDNA reads are written to separate FASTQ files (in the case of paired-end sequencing, there are two cfDNA FASTQ files and one barcode file), but the lines correspond to each other. This makes the demultiplexing process very straightforward (and is usually included with the sequencing instrument software). The user first launches the barcode of the sample name and enters the software. A simple script (program) walks through the FASTQ files for the barcode and cfDNA reads and copies the cfDNA reads based on the barcode into the sample-specific FASTQ files. Here, to minimize discarding of reads due to small NGS errors during barcode sequencing (e.g., incorrect base calls in the barcode), barcodes are typically chosen that are dissimilar enough that even one or two mismatches can be clearly distinguished from each other (Reference 30).
The basic premise of cfDNA NIPT, i.e. to determine whether the fetal genome is present in abnormal amounts in a particular genomic region, is that cfDNA molecules are mapped to their region of origin. This mapping occurs during the process of alignment. The basic idea of alignment is simple: for a given read, consisting of tens to hundreds of characters, find the location where the same character appears in a string of approximately 3 billion characters (i.e. the human reference genome). Although the concept is simple, it is not easy to perform in an efficient manner. NIPT analysis usually considers only the uniquely mapped reads, and such sequences that may come from multiple locations are considered redundant, but the reads must be considered as a whole genome. However, scanning the whole genome for each individual read and checking whether each read matches the read it is compared to at each end is sufficient, but very inefficient. This simple approach requires 3 billion comparisons per read (one for each offset in the reference), and performing this procedure on billions of reads from a single flow cell would require up to 1018 calculations. To further complicate the issue, NGS reads are often different from the comparison reads in a particular NIPT method (e.g., SNPs). This makes it harder to determine if a read maps to a particular comparison location than it is to find an exact match. Instead, the mapping algorithm must capture whether the reads are roughly similar at a particular comparison location.
The survival of NGS as an easy-to-use technology, not only in general but also for NIPT in particular, depends on the development of fast algorithms that can align millions of reads to the human genome in minutes (Refs. 31, 32). As developers of these algorithms, one of the key insights is that, unlike the experimental data itself, the reference genome is static. Thus, preprocessing the genome to make it easier to search can have a major impact on subsequent performance. In fact, the most popular alignment software packages share the upstream alignment process and create a set of index files of the reference genome while performing it. The primary index file is an elaborately permuted version of the genome that preserves all the information about the original genome sequence and reorders the sequence in a way that makes it easy to search quickly. This special ordering allows reads to be mapped to one base at a time. This means that if the read contains consecutive bases (e.g., G (guanine) bases), then roughly 75% of the transformed genome can be ignored in each future search (e.g., those with A, C, or T). By iterating over successive bases, the true origin of the reads can be rapidly approached. In fact, with this converted genome, the reads can be mapped in as little as 10-20 calculations (i.e., 10-20 bases). This is in stark contrast to the naive algorithms mentioned above, which require 300 million calculations to compare the reads to the sequence at the reference genome position. These alignment algorithms implement subtle modifications to be robust to gaps and mismatches, but require minimal overhead to support their functionality. Finally, the preprocessed genome index allows for plausible alignment within the timescale required for clinical cfDNA NIPT.
Alternative and non-alternative technologies for next-generation sequencing
While most sequencing-based NIPT tests use Illumina instruments, any technology, sequencing-based or otherwise, that can characterize the genomic locations of many cfDNA molecules may be feasible for cfDNA-based NIPT as long as it operates at a sufficiently high speed and throughput. Illumina’s specific sequencing-by-synthesis technology has supplanted other related approaches (e.g., pyrosequencing) and sequencing-by-ligation platforms (e.g., SOLiD) because of its low per-base cost and the ability to scale per-base to meet clinical reporting times (10). However, other competing sequencing technologies that are faster and cheaper appear to be able to rapidly change the world of NIPT sequencing. For example, nanopore sequencing is still in its infancy but is likely to soon gain an advantage over sequencing-by-synthesis approaches (33). Nanopores determine DNA sequences by measuring the voltage signature of long DNA molecules as they pass through a protein pore embedded in a charge-blocking membrane. Each group of nucleotides (e.g., GCGTA) in the pore has a characteristic voltage level. The base-calling algorithm deconvolves this voltage level into a specific DNA sequence based on the entire voltage trajectory of the DNA molecule. The speed and information throughput of nanopores make them attractive for cfDNA-based NIPT applications. Nanopore developers have struggled with a major limitation of nanopores, the high error rate that complicates the identification of variants in a patient’s genome, but even this limitation may be somewhat reduced by the tolerance to errors in depth-based cfDNA NIPT as long as the alignment algorithm can map the reads to their original location in the genome (see Chapter 3). Nanopores are best suited to sequencing very long DNA molecules as they approach hundreds of thousands of bases, far more than the <150 bases contained in a single fragment of cfDNA. For this reason, optimal library preparation for nanopore sequencing may involve extensive concatemerization of cfDNA, stitching together hundreds of cfDNA molecules into a single molecule. Importantly, this discussion of using nanopores for cfDNA NIPT is largely speculative and is not specifically guaranteed for cfDNA NIPT. Indeed, nanopores are currently better suited for other genomics applications, which are also the motivation for their development. If anything, however, nanopores highlight the idea that sequencing technologies are evolving and new techniques may emerge at any time. Today, sequencing-by-synthesis approaches are ubiquitous in sequencing-based assays for NIPT, not because they are intrinsically superior, but because they offer the best per-base cost and time in a rapidly evolving field.
As mentioned above, a suitable DNA technology for NIPT only needs to be able to quickly and cheaply map cfDNA fragments to genomic locations, but microarray results show that the technology does not strictly require DNA sequencing. Microarray-based cfDNA NIPT tests measure the abundance of hundreds of thousands of cfDNA fragments using specific hybridization probes that cover the genomic region of interest for comparison (Ref. 34). Microarrays (discussed in more detail in Chapter 3) can also locate many specific alleles, and sampling in these highly polymorphic SNP regions provides information about the fetal fragments. The idea of quantification using cognate DNA (e.g., hybridization probes on microarrays) rather than directly sequencing the sequence of interest also underlies recent attempts to perform cfDNA NIPT using quantified polymerase chain reaction (PCR) (Ref. 35), where appropriately selected primer sets can measure the abundance of cfDNA in NIPT-relevant regions.
In conclusion
Next-generation sequencing (NGS) is particularly well suited as an NIPT application for two subtle but obvious reasons, and some less obvious. The obvious reason is that NGS provides digital, nucleotide-level data that enables the depth- and allele-based NIPT workflow, which requires identifying and measuring the amount of cfDNA fragments (discussed in more detail in Chapter 3). Crucially, NGS instruments can generate this data easily and quickly, helping NIPT address limitations for laboratories, test takers, and patients.
On the other hand, a less well-known reason is that NGS captures signals related to cfDNA derived from the placenta. For example, in NGS, which measures fragment length, placental fragments are generally shorter than maternal fragments. Also, fetal fragments are characterized by DNA methylation, which can be detected by NGS after bisulfite treatment (Reference 36: methylated c bases remain unchanged, but unmethylated c bases react with bisulfite to convert to uracil (the sequence of which resembles thymine bases). Finally, NGS identifies and reports the position of the ends of cfDNA fragments during single nucleotide determination, and this end information contains important placental signals. This is because the positions of maternal and placental nucleosomes are structurally different when determining the positions of the ends of cfDNA fragments (Reference 37). Analysis algorithms that extract and amplify such placental signals can be used to sensitively capture fetal chromosomal abnormalities.
cfDNA-based NIPT testing is now rapidly becoming routine in clinical prenatal diagnosis, mainly due to the maturation of NGS as a means to read and count cfDNA. Such widespread adoption of NIPT in clinical practice has stimulated technological developments to reduce costs and generate large data sets that will lead to the discovery of more nuanced, placenta-specific signals. Thus, the movement to quantify and theorize cfDNA will continue to steadily advance at a rapid pace, and such improvements will improve the results of cfDNA-based NIPT and make it more widely available.