ESTs

Classic DNA Sequencing

The base-by-base sequencing of specific regions of the chromosome or whole genomes would have been impossible on a large scale without the improvements being made DNA sequence determination techniques. Thanks primarily to Sanger's method of enzymatic sequencing (1977), the discovery of the polymerase chain reaction or PCR (1986), and the development of automated sequencing in the early 1990s, the productivity of 1 base sequenced over the course of a year's work by a single operator (1 base/year/man) in 1965 has increased to 1 billion bases/year/man in 2000 and up to 10,000 billion bases/year/man in 2021. The actual sequence of the four nitrogenous bases along the double helix constitutes the physical map with the highest degree of resolution and has been publicly available in electronic form since April 14, 2003, for almost all of the euchromatic regions of human chromosomes.

The ultimate goal of any systematic mapping project is to determine the complete nucleotide sequence of the DNA molecule that makes up the chromosome. Due to technical limitations inherent in the enzymatic sequencing method employed up to the late 2000 years, a continuous sequence no longer than 500 to 1,000 bases (typically 700) can be obtained in a single experiment. It is, therefore, necessary to proceed by sequencing small stretches of DNA whose respective sequences are finally assembled into a continuous template, termed a "contig." The process goes back toward a single final "contig," represented by the entire chromosome, by exploiting the information on the relative position of isolated DNA fragments or by "de novo" assembling of the fragment sequences.

Nucleotide sequencing was performed in the 1960s and 1970s by traditional biochemical methods, which involved detaching (and analyzing) one nucleotide after another from a DNA strand. These methods were very time-consuming and resource-intensive: Robert Holley's group took a year to identify the sequence of the 65 nucleotides that make up the tRNA for yeast Alanine (Holley et al., 1965).

A fundamental breakthrough occurred in 1977 when two new DNA sequencing methods were described: the chemical method, proposed by Maxam and Gilbert (base-specific chemical cleavage method), and the enzymatic method, devised by Frederick Sanger and also known as the "chain termination method" or "dideoxy method." Sanger's method, for which he received the Nobel Prize in Chemistry in 1980 (check Fig. 3 there), took over because of its greater simplicity of execution and productivity and formed the basis of commonly used DNA sequencing methods to this day.

The enzyme that makes Sanger's method possible is DNA polymerase, which, in the presence of the four deoxynucleotide monomers (dATP, dGTP, dCTP and dTTP), is capable of synthesizing a complementary copy of a single-stranded DNA molecule, provided there is a short initial double-stranded region. In this region, the second strand, paired with the first strand by base complementarity, serves as a primer ("primer") for the extension of the new strand. In the laboratory, synthetic oligonucleotides, i.e., small single-stranded DNA chains of about 20 nucleotides that possess a sequence complementary to that of the point from which sequencing is to be initiated, are used as primers. The primer appends itself by providing a free 3 ́-OH end from which DNA polymerase successively adds nucleotides complementary to those on the strand to be sequenced, originating a new strand by a polymerization process that proceeds in the 5 ́-3 ́ direction. The sequence to which the primer appends must be known, and in the case of fragments cloned within a vector, the region of the vector bordering the insert of the unknown sequence can be used for this purpose.

Sanger's method consists of using a DNA fragment of unknown sequence as a template in a polymerization reaction catalyzed by one of the DNA polymerases known to lack exonuclease activity, i.e., the ability to remove nucleotides from the end of a strand, which is useful in vivo but could degrade DNA in vitro. The variant of Sanger's method commonly in use today involves a polymerization reaction, set up with the isolated and purified DNA to be sequenced, DNA polymerase and the four triphosphate deoxynucleosides. The key to the method is the further addition of 2 ́,3 ́-dideoxynucleotides (ddNTPs), nucleotides modified for the loss of the hydroxyl group at the 3 ́ end (ddATP, ddGTP, ddCTP, ddTTP). Dideoxynucleotides can be added to the 3 ́ end in the course of elongating a DNA strand, but because they lack the 3 ́-OH group, they cannot accept the addition of the next nucleotide, so each time a dideoxynucleotide is incorporated into the new strand, polymerization stops and the last base paired will then turn out to be the one contained within the dideoxynucleotide itself. Suppose the concentration of the added dideoxynucleotide in each reaction is adjusted so that it is incorporated only occasionally. In that case, polymerization will proceed smoothly using the four normal nucleotides and will stop when, at a random position among those containing a base complementary to the nitrogenous base of the specific dideoxynucleotides in the tube, a dideoxynucleotide is inserted. Four tubes are necessary to carry on separate reactions for each of the four nucleotides. The logic of the process is thus that of premature base-specific chain termination. Since this is a random process, a population of fragments of varying lengths, each terminating with a dideoxynucleotide, is generated in each tube. At the end of the reaction, there will then be a pool of chains in the tube with termination points corresponding to each base of the strand to be sequenced. The sequencing reaction products are then separated according to their length due to their different migration rates when subjected to polyacrylamide gel electrophoresis, and visualized thanks to radioactive labelling of the fragments (e.g., using 33P 5 ́ end-labelled fragments generated by Polynucleotide Kinase - PNK). With this method, a difference in mass of only one nucleotide is sufficient for the migration speed of DNA chains of different lengths to change, so the shorter chains gradually arrive at the bottom of the gel first, then the longer ones. The sequence of bands arriving one after another at the bottom of the electrophoretic run thus coincides with the sequence of bases in the DNA examined.

Although the method used today is still based on the rational basis originally described by Sanger, since the 1990s, continuous technical improvements have significantly increased its productivity (review in Ciccodicola and D'Urso, 1998). In particular, five improvements allowed a high increase in the processivity of the original method, in which a different reaction was set up for each nucleotide.

1. Cycle Sequencing. The possibility of increasing the amount of the terminated chains through repeated cycles of in vitro DNA replication using a DNA polymerase has made it possible the direct sequencing of DNA without in vivo amplification in vector-transfected hosts. In particular, cycle sequencing is the repetition of the steps of denaturation, annealing of the primer and extension in a way similar to the one used in PCR. However, in this case, a single primer is used, and a single strand is obtained at each cycle so that there is no exponential amplification of the product but only a linear amplification. For instance, after 20 cycles, we will obtain 20 times the amount of chains compared to a single run of polymerization, and not 220, because the products of each cycle are not a substrate for the next polymerization reaction, which restarts from the original template bound by the single type of primer available. The final result is a larger quantity of terminated molecules.

2. Better incorporation of ddNTPS. The Sanger reaction itself has been improved by using polymerases specifically designed to incorporate ddNTP with high efficiency. It should be noted that the incorporation of unusual forms of dNTPs, such as ddNTPs, is not a natural function of DNA polymerases. Some mutated DNA polymerases obtained through genetic engineering have shown the ability to incorporate ddNTPs with higher efficiency, and are ideal for setting up Sanger reactions. For example, Taq DNA polymerases in which the phenylalanine is substituted by a tyrosine at position 667 (Taq F667Y, also known as "FS Taq Polymerase") can incorporate ddNTPs much more efficiently than the wild-type Taq DNA polymerase.

3. Replacing radioactive labelling with fluorescent labelling. The increased amount of the DNA interrupted chains (point 1. above) made it possible to detect the result of the Sanger reactions by adding a base-specific fluorescent dye (or fluorochrome) to it. Fluorochromes are compounds that can emit, under certain conditions, a visible light beam of a specific wavelength, thus of a particular colour. For example, to mark dideoxynucleotide with base A, one could use a fluorochrome that emits green light when excited by a laser beam, while using different fluorochromes to mark dideoxynucleotides with bases G, C, and T, the light emitted could be yellow, blue and red, respectively. The use of flurochromes thus allows the running of the interrupted chains for all four nucleotides in a single lane.

4. Capillary gel and automation. A substantial technological advance has been the automation of the laboratory procedure, thanks to the spread of automated DNA sequencing machines. These machines are based on a capillary tube that is filled with small amounts of a specially formulated gel. A robotic system then loads the sequencing reaction products into one end of the capillary, which is then subjected to an electric field to separate the molecules. A laser positioned at the opposite end of the capillary induces fluorochrome light emission as DNA molecules of different lengths finish their run, and a detector system identifies and records which dye they bear.

Parallelization may further increase the throughput of the technique: with a 96- or 384- capillary instrument, if each run yields a sequence of 700 nucleotides and each capillary operates three runs a day, about 800,000 bases per day can be sequenced. Even in the ideal case of uninterrupted use of such a sequencer, it would still take about ten years to complete the sequencing of the approximately 3 billion base pairs of a single haploid human genome. This time can, of course, be reduced if one has many of these expensive machines operating in parallel, as in the case of the initial map of the human genome in 2001.

Several groups of researchers then worked on developing techniques that do not require enzymes and are based on innovative principles. Their success since 2008 ("Next Generation Sequencing", NGS) led to a significant reduction in the time required to obtain the results of an experiment, or test, based on DNA sequencing, with important practical implications for the development of genomic medicine.

Genome Research
Special Issue on Long-read DNA Sequencing Applications in Biology and Medicine
https://genome.cshlp.org/site/misc/special_issue.xhtml