The ultimate goal of genome analysis is to determine the complete nucleotide sequence of the human genome: 3 × 109 base pairs of DNA. To understand the magnitude of this undertaking (called the Human Genome Project), recall that the human genome is more than ten times larger than that of Drosophila; that the smallest human chromosome is several times larger than the entire yeast genome; and that the extended length of DNA that makes up the human genome is about 1 m long, whereas that of the fully sequenced bacterium Haemophilus influenzae is less than 1 mm long. Nonetheless, the sequencing of human DNA is well underway and its completion is expected in the near future. Identification of the estimated 100,000 human genes will not only open new horizons for research but will also have an immense impact on medicine by illuminating the genetic basis of many human diseases.
The human genome is distributed among 24 chromosomes (22 autosomes and the 2 sex chromosomes), each containing between 5 × 104 and 26 × 104 kb of DNA (Figure 4.26). The first stage of human gene mapping is the assignment of known genes to a chromosomal locus, usually defined as a metaphase chromosome band. Of the estimated 100,000 human genes, about 6000 genes of known function have been identified and positioned on the human chromosome map, using the basic approaches described in this section.
The first general method developed for localizing human genes to individual chromosomes or subchromosomal regions was somatic cell hybridization (Figure 4.27). Cells in culture can be induced to fuse with each other by treatment with certain viruses or chemical agents, such as polyethylene glycol. The nuclei in such cells also fuse, yielding hybrid cells that contain chromosomes derived from both parents. Such hybrids can be made not only between cells of the same species but also between cells of different species, such as human and mouse. Importantly, the human chromosomes are unstable in such human-mouse hybrids and are progressively lost during growth of the hybrid cells. It is therefore possible to derive hybrid cell lines that contain only one or a few human chromosomes. In some cases, hybrid cell lines containing only fragments of human chromosomes can also be obtained. The chromosomal locus of a human gene can then be determined by screening of a panel of somatic cell hybrids containing different sets of human chromosomes. For example, hybridization of a cloned human gene to DNAs extracted from such a panel of somatic cell hybrids can be used to map the cloned gene to a human chromosome (see Figure 4.27).
Higher-resolution chromosomal localization of cloned genes is possible with in situ hybridization to chromosomes, usually using fluorescent probes-a method generally referred to as fluorescencein situhybridization, or FISH. In situ hybridization to metaphase chromosomes allows the mapping of a cloned gene to a locus defined by a chromosome band (Figure 4.28). Although FISH provides greater resolution than analysis of somatic cell hybrids, each band of metaphase chromosomes still contains thousands of kilobases of DNA. In situ hybridization to metaphase chromosomes therefore does not provide the detailed mapping information obtained by hybridization to the polytene chromosomes of Drosophila, which allows the localization of genes to interphase chromosome bands containing only 10 to 20 kb of DNA. Higher resolution can be obtained, however, by hybridization to more extended human chromosomes from prometaphase or interphase cells, allowing the use of fluorescence in situ hybridization to map cloned genes to regions of about 100 kb.
Human genes can also be mapped by genetic linkage analysis, in which the distance between genes is estimated by recombination frequencies (see Figure 3.5). Genetic linkage was first used to map genes to the four Drosophila chromosomes and has subsequently been used to generate genetic maps of various other organisms, including E. coli, S. cerevisiae, C. elegans, and the mouse. However, construction of extensive genetic linkage maps generally requires the isolation of hundreds of single gene mutations, followed by extensive analysis of their inheritance in controlled breeding experiments. Although this approach applies to a variety of model organisms, it clearly cannot be applied directly to humans. This problem was overcome, however, by the realization that not only gene mutations but also any other detectable differences in DNA sequences between individuals can be used as genetic markers in human DNA. This approach greatly extended the range of available markers in the human genome and allowed analysis of recombination frequencies in small groups of families, leading to the development of a complete genetic map of the human genome.
The first sequences used as genetic markers were based on inherited differences in cleavage sites for restriction endonucleases. A single base change within a restriction endonuclease site is a readily detectable genetic marker because the mutated site is no longer cleaved by the enzyme in question. Two chromosomes that differ by such a mutation are then distinguishable on the basis of a restriction fragment length polymorphism (RFLP), which arises because a particular cleavage site is present in only one of the two DNA molecules (Figure 4.29). A mutation that gives rise to an RFLP thus represents a genetic marker that can be detected by Southern blot hybridization of restriction endonuclease-digested DNA with an appropriate probe.
Many RFLPs have been identified simply by hybridization of cloned probes to Southern blots of restriction endonuclease-digested DNAs of a series of individuals. These RFLPs provide a large set of genetic markers that can be used to construct linkage maps by studies of a limited number of families. For example, the first complete linkage map of the human genome was generated in 1987 by studies of the inheritance of 393 RFLPs in 21 families, each of which consisted of grandparents, parents, and children.
In addition to RFLPs, particularly useful genetic markers are tandem repeats of short nucleotide sequences (called microsatellites) distributed throughout the genome (Figure 4.30). For example, the human genome contains about 100,000 blocks of tandem repeats of the dinucleotide CA. Because the number of CA repeats at any given locus varies between individuals, differences in the number of tandem repeats can be used as genetic markers. Such variations can be easily scored by PCR using flanking-sequence primers, facilitating the use of microsatellites as genetic markers. The current linkage map of the human genome consists of more than 8000 loci (defined primarily by short tandem-repeat polymorphisms), which are spaced at an average distance of less than 1 Mb of DNA (Figure 4.31).
Once markers such as RFLPs and short tandem-repeat polymorphisms (STRPs) have been mapped, they become available for use as reference markers for mapping other genes. Thus, the map position of any human gene can be determined by analysis of its linkage to these polymorphic DNA sequences. The most important application of this approach has been the genetic mapping and eventual cloning of genes responsible for inherited human diseases. Such isolation of human disease genes provides much of the impetus for mapping and sequencing the human genome. The basic approach to gene mapping is to analyze a series of polymorphic markers (e.g., RFLPs) in families carrying the disease gene of interest (Figure 4.32). Genetic linkage between an RFLP and the disease gene is then indicated by coinheritance of the disease and the RFLP marker in affected family members. Given that the RFLP has been mapped to a chromosomal locus, this linkage analysis determines the map position of the disease gene on a human chromosome.
Once the map position of a gene has been determined, that information can be used to isolate the gene as a molecular clone-a strategy known as positional cloning. This approach has been used to clone the genes for several human diseases; the gene for cystic fibrosis was one of the first examples. Cystic fibrosis is an inherited disease that affects approximately one in 2500 newborn Caucasians, although it is rare in other races. The disease is diagnosed in childhood and is inevitably fatal by about age 30. Most patients die from respiratory complications. The cystic fibrosis gene was first mapped by linkage to an RFLP, which was localized to chromosome 7 (Figure 4.33). Further mapping studies established that the cystic fibrosis gene is located between two RFLPs separated by approximately 1 Mb. Molecular clones from this region of DNA were isolated by the use of RFLPs as probes. These clones were then used as probes to isolate clones containing adjacent DNA sequences, leading to the isolation of a series of clones covering the region containing the cystic fibrosis gene. Analysis of these genomic clones, followed by cDNA cloning and sequencing, eventually identified a candidate gene that spans more than 250 kb of DNA and is transcribed to yield a 6-kb cDNA containing 24 exons. DNA sequencing confirmed that this gene is the cystic fibrosis gene by demonstrating that it is mutated in affected individuals.
The isolation of human disease genes has important implications not only for understanding the molecular and cellular basis of the disease, but also for prevention and treatment. An immediate benefit of cloning the gene responsible for an inherited disease is the potential for developing sensitive diagnostic tests for mutant alleles of the gene, such as the detection of mutations by PCR. These tests can identify individuals at risk for the disease and, at least for some diseases, enable appropriate preventive measures to be instituted before the disease develops. In addition, the ability to detect mutant alleles of disease genes opens the possibility of prenatal diagnosis to prevent transmission of the disease to future generations. Couples carrying mutant alleles of the gene can be identified, and their pregnancies monitored to determine if the mutant gene has been transmitted to the fetus. Because of the sensitivity of PCR analysis, mutations can be detected even in early embryos that have been derived by in vitro fertilization procedures prior to their return to the uterus.
In addition to these diagnostic measures, the cloning of disease genes suggests the potential use of gene therapy to correct the defect. The general strategy of gene therapy is to introduce a normal cDNA allele into the affected cells of patients. In some cases, genes can be introduced into target cells (e.g., lymphocytes) that can be maintained in culture and then returned to the patient. Alternatively, appropriate vectors can be used to introduce genes into some types of cells in vivo. For example, gene therapy trials for cystic fibrosis are evaluating the use of aerosols to deliver viral vectors to the affected epithelial cells lining the airways of the respiratory tract. Gene therapy is still in its early, experimental stages of development. However, the progress thus far clearly illustrates the general feasibility of the approach, suggesting that gene therapy will eventually become an important method of treating at least some human diseases.
A physical map of cloned DNA segments is a necessary intermediate between the genetic map and the nucleotide sequence of the human genome. Because of the large size of the genome, YACs provide the obvious vector for physical mapping of cloned human DNA. The most useful markers for analyzing these clones appear to be sequence-tagged sites (STSs), which are readily detected by PCR analysis. STSs are short segments (usually 200 to 500 base pairs) of nonrepetitive DNA whose nucleotide sequences have been determined. Many STSs, each located at a unique site in the genome, can thus be defined. Since an STS corresponds to a segment of sequenced DNA, it is detectable in any sample of DNA by PCR amplification using the appropriate primers (Figure 4.34). This ease of detection provides a considerable advantage over other methods, such as nucleic acid hybridization, for large-scale analysis of cloned DNAs. Moreover, since STSs are obtained directly from sequence data, they are available for any investigator to use; laboratories need not exchange cloned DNAs. STSs thus provide an accessible common language for genome mapping and sequencing.
An alternative approach to analyzing the entire genome has been to focus on protein-coding genes by sequencing cDNAs. Assuming that only about 3% of the genome corresponds to protein-coding sequences, this task is obviously much more modest than sequencing the entire genome. The goal of sequencing protein-coding regions has been approached by the random sequencing of short regions (200 to 300 base pairs) of cDNAs. These short cDNA sequences are known as expressed sequence tags (ESTs) and can be used as markers for sequences in mRNAs. More than a million EST sequences, corresponding to a significant fraction of the total expected sequence present in human cDNA, have now been obtained. This database of available ESTs provides a valuable resource-for example, by enabling the ready identification of new genes related to a cloned gene of interest. In addition, STS markers derived from cDNAs can be used directly for mapping human genes. This approach has allowed construction of a high-resolution map in which cDNA markers derived from over 30,000 human genes have been positioned on a physical map of the human genome, which has been integrated with the genetic map (Figure 4.35). Thus, a comprehensive map of one-third to one-half of all human genes is now available.
The ultimate genetic map, of course, will be the complete sequence of human DNA. A major step toward this goal was accomplished in 1999 with the sequencing of human chromosome 22. Chromosome 22 is one of the smallest human chromosomes, containing approximately 33.4 × 106 base pairs of DNA. Analysis of the sequence indicates that chromosome 22 contains at least 545 genes (Figure 4.36), although this is likely to be an underestimate because the presence of extensive introns makes it more difficult to identify genes in human DNA than in the DNA of lower eukaryotes. Thus, the sequence of chromosome 22 is consistent with the estimate of approximately 100,000 genes in the human genome, the complete sequence of which will be available in the near future.