GENE SEQUENCING

Introduction

The DNA containing life's instructions is found in a set of chromosomes within each of the cells that make up all organisms. All of the instructions encoded in DNA are spelled out using a shockingly simple alphabet consisting of four 'letters': A, C, G, and T. Each of these letters represents a molecule called a nucleotide that is composed of a sugar, a phosphate, and a base. It is the bases that differentiate each of the four DNA nucleotides.

A = Adenine
T = Thymine
C = Cytosins
G = Guanine

If the classical DNA double helix is untwistedÑmuch as it is each time the molecule self-replicatesÑa ladder-like structure is revealed. The sides of the ladder are formed by the repeating sugar and phosphate units of DNA's component nucleotides. The rungs are each made of a pair of the nucleotide bases, which can only pair up in a strictly defined way: Adenine is always paired with Thymine, while Cytosine is always paired with Guanine. This base pair complementarity underlies much of DNA's remarkable behavior. It is also the trait that allows molecular biologists to cut, clone, probe, and determine the precise sequence of an organism's genetic material.

Modern genetic sequencing of the sort that made possible the completion of the Human Genome Project, is a technique used by numerous biotech research teams described on this website. It is a highly automated, computer-controlled process, but the basic techniques can be briefly outlined here.

Step By Step

The process begins with a sample of several cells from a human, or a fruit fly, or a sponge, or whatever organism is being studied. Starting with multiple cells ensures that multiple copies of all the DNA are present. These are added to a buffer solution.

Next, Restriction enzymes are added to the mix. Restriction enzymes have the unique ability to cut large strands of DNA into smaller fragments, an ability thought to be used as a defense against invading bacteriophages (bacterial viruses), by cleaving the molecule only at locations containing a specific short sequence of bases. Several dozen restriction enzymes have now been discovered, mostly in bacteria, and each is capable of cleaving DNA any time it encounters the particular base sequence it was built to recognize. The restriction enzyme Hind III, for example, cleaves DNA whenever it encounters the base sequence AAGCTT, with the cuts being made between the two Adenine bases in every instance. Because a simple sequence such as AAGCTT is bound to be repeated countless times in the millions of bases in an organism's DNA, addition of restriction enzymes is guaranteed to do plenty of chopping.

Only a limiting amount of the selected restriction enzyme is added because, importantly, the DNA samples must not be cut at all possible cleavage points (e.g, every AAGCTT run if Hind III is used). This is because if all possible cuts were made there would be no fragment overlap between the DNA copies that have been cleaved, and identifying such overlaps is a key part of gene sequencing and mapping.The goal at this stage is instead to end up with fragments that are around 150,000 bases long.

Although the process began with multiple sample cells and, hence, several copies of starting genetic material, the sequencing process requires copies in far, far greater abundance. To get the copies needed, molecular biologists turn to bacteria (or sometimes yeast) to do the hard work.

The replication begins with small, circular pieces of DNA called cloning vectors that are capable of replicating on their own when inside a cell. These can be made artificially, but also occur naturally in a variety of forms such as plasmids, which are bacterial DNA segments found outside an organism's chromosomes. These vectors are combined with restriction enzymes that open (straighten) and cut them in such a way that the ends are staggered. Using the same restriction enzyme with the vectors as that used to create the sample DNA fragments insures that both the cut vectors and the sample fragments will have complementary ends. Once an additional enzyme known as DNA ligase is added to the mixture, these complementary ends will bond, creating "loaded" vectors that are once again closed and circular.

The loaded vectors are then inserted into live bacterial cells, typically by shocking the bacteria with a mild electrical charge, a process called electroporation. Normally, molecules cannot pass easily through a bacterium's cell membranes. However, the electrical shock temporarily breaks bonds between fatty acids found in the membrane, allowing DNA to pass through and allowing introduction of the vectors inside the cells.

The bacteria containing the newly inserted vectors are then thinly plated onto a suitable growth medium and incubated. As the bacteria repeatedly divide, each daughter cell formed contains a new copy of the inserted DNA vector. Soon, visible colonies are formed on the culture plate. As each distinct colony arose from a single bacterial cell, each colony contains many clones of one particular vector-incorporated target DNA fragment.

Once the colonies have grown to contain about a million cells, a single colony is selected and used to inoculate a liquid scale-up culture. Cell division continues in the liquid until several billion copies of the original cell (and inserted vector) are obtained.

The cloned DNA fragments are recovered from the bacteria by using detergent to rupture the cell walls. Sodium hydroxide is then added because it degrades the relatively larger bacterial DNA while leaving the relatively smaller DNA vectors fairly undamaged. The process yields billions of copies of vector-incorporated target DNA all cloned from a single starting DNA fragment.

At around 150,000 base pairs, the target DNA fragment is still too long for current sequencing technology to handle straightaway. This fragment must be broken down further by repeating the entire process, beginning with the addition of more restriction enzyme. The result, again, is billions of copies of a single vector-incorporated DNA fragment; now however, the fragment length is in the 2,000-4,000 base pair range.

The next step in the process is to create strands new strands of the target DNA sequence that are fluorescent. The first step in this part of the gene sequencing process is to apply heat to the cloned DNA fragments to separate the double-stranded DNA into single strands.

To promote the synthesis of new strands of DNA complementary to the single stranded target material, sufficient amounts of free nucleotide bases (A, T, C, G) are added to the reaction vessel along with DNA polymerase. This is an enzyme needed for reading the single stranded DNA templates and assembling the complementary strands from free nucleotides. A DNA primer is also added. These are short DNA chunks of known sequence that bind to complementary sites on the single stranded DNA vectors that then initiate construction of the complementary strands.

Finally, a measured amount of fluorescently tagged dideoxynucleotide bases (ddA, ddT, ddC, ddG) is added. These behave differently than the 'regular' (deoxy) nucleotides in two important ways:

First, the dideoxynucleotides have an altered molecular structure that causes them to halt the construction of complementary DNA strands after they have themselves been added to the strands. This ensures that the tagged nucleotides are always the last bases on any strand they are built into.

Second, the dideoxynucleotides are tagged with a marker that visibly fluoresces when it absorbs energy emitted from a laser. The tags are specifically color-coded so that each nucleotide base can be positively identified:

ddA = green
ddG = yellow
ddC = blue
ddT = red

Recall that in a previous step it was important not to add too much restriction enzyme so that adequate fragment overlap would be ensured. Similarly, it is important here not to add too much dideoxynucleotide base, relative to the amount of regular nucleotide provided. The idea is that complementary DNA chain elongation is supposed to proceed for a while using available free nucleotides and DNA polymerase until, by chance, a ddA, ddG, ddC, or ddT is added to the chain and elongation stops. Using this technique, a set of DNA fragments with the same starting point, which is defined by the primer used, but having many differing total base lengths is generated.

Heat is again used to separate the newly synthesized strands from their vector templates. Additional free nucleotides, tagged nucleotides, primer, and DNA polymerase can be added and the original single-stranded for reuse of the vector templates as many as 40 times. This allows generation of billions upon billions of strands of complementary DNA of all possible base lengths.

These billions of DNA copies must now be sorted according to length, a task accomplished through gel electrophoresis. This process is most often carried out within small plastic capillary tubes in automated sequencing systems.The entire collection of DNA fragments is placed in one end of a gel-filled capillary, and an electric current is applied to the gel. This causes the slightly negatively charged DNA molecules to migrate toward the positive end of the electrical field. The natural resistance of the gel allows smaller fragments to migrate faster than larger fragments.

A laser is aimed at a fixed location on the capillary. As each subset of DNA fragments of a particular length pass through the beam, energy striking them causes the color-coded dideoxynucleotide tags to fluoresce, revealing the identity of each successive base in the target DNA sequence.

Even though each strand is very, very small, the combined fluorescence of many identical strand copies passing through the beam simultaneously is intense enough to be recorded by CCD sensor and the information sent to an attached computer.

Current technology only allows about 500 bases at a time to be sequenced in this manner. This is only about one-third the average length of the coding region of a gene, and only around 3% of the average total gene size when non-coding regions are also considered.

Overcoming this limitation is where all of those overlapping fragments come into play. Repeating the sequencing procedure using many different 2,000-4,000 base pair target fragments (i.e., obtained from different clonal bacterial colonies) will reveal sequences with varying degrees of overlap. Repeating the procedure with many different 150,000-base starting fragments reveals longer sequences and still more overlap. By painstakingly charting where these overlaps occur, a map can be created showing the proper positions of each fragment with respect to the others.

Automation of the sequencing process, using robots, supercomputers, and redundant sequencing arrays, now allows sequences to be decoded at rates of thousands of bases per hour. Such automation was at the heart of the Human Genome Project [LINK?]. Multiple highly automated sequencing facilities around the world operated at breakneck speed to sequence human DNA one nucleotide at a time until entire fragments, then entire genes, entire gene regions, chromosomes, and eventually the entire human genome Ð all 3 billion base pairs of it Ð was mapped out. Gene sequencing is used in countless less complex ways to advance all aspects of biotechnology, including marine biotechnology.