Some of the most exciting endeavors in molecular biology are the current efforts directed at obtaining and analyzing the complete nucleotide sequences of both the human genome and the genomes of several model organisms, including E. coli, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila, Arabidopsis, and the mouse. Recombinant DNA has had an enormous impact on our understanding of the molecular basis of cell biology by allowing the isolation and sequencing of a wide variety of important genes. Until the last few years, scientists had generally focused on the cloning, sequencing, and further characterization of specific genes. Now, however, recombinant DNA technology allows the alternative approach of cloning and sequencing entire genomes. In principle, this approach has the potential of identifying all the genes in an organism, which then become accessible for investigations of their structure and function. Genome sequencing is thus providing scientists with a unique database, consisting of the nucleotide sequences of complete sets of genes. Since many of these genes have not been previously identified, determination of their functions will form the basis of many future studies in cell biology.
We now know the complete genome sequences of nearly two dozen different bacteria, and many others are in the process of being determined. The first complete sequence of a cellular genome, reported in 1995 by a team of researchers led by J. Craig Venter, was that of the bacterium Haemophilus influenzae, a common inhabitant of the human respiratory tract. The genome of H. influenzae is approximately 1.8 × 106 base pairs (1.8 megabases, or Mb), slightly less than half the size of the E. coli genome. The complete nucleotide sequence indicated that the H. influenzae genome is a circular molecule containing 1,830,137 base pairs of DNA. The sequence was then analyzed to identify the genes encoding rRNAs, tRNAs, and proteins. Potential protein-coding regions were identified by computer analysis of the DNA sequence to detect open-reading frames-long stretches of nucleotide sequence that can encode polypeptides because they contain none of the three chain-terminating codons (UAA, UAG, and UGA). Since these chain-terminating codons occur randomly once in every 21 codons (3 chain-terminating codons out of 64 total), open-reading frames that extend for more than a hundred codons usually represent functional genes.
This analysis identified six copies of rRNA genes, 54 different tRNA genes, and 1743 potential protein-coding regions in the H. influenzae genome (Figure 4.20). More than a thousand of these could be assigned a biological role (e.g., an enzyme of the citric acid cycle) on the basis of their relationships to known protein sequences, but the others represent genes of currently unknown function. The predicted coding sequences have an average size of approximately 900 base pairs, so they cover about 1.6 Mb of DNA. Protein-coding sequences thus correspond to more than 85% of the genome of H. influenzae, consistent with previous estimates that protein-coding sequences account for most of the DNA in bacterial genomes.
The complete sequence of the genome of Mycoplasma genitalium is of particular interest because mycoplasmas are the simplest present-day bacteria and contain the smallest genomes of all known cells. The genome of M. genitalium is only 580 kb (0.58 Mb) long and may represent the minimal set of genes required to maintain a self-replicating organism. Analysis of its DNA sequence indicates that M. genitalium contains only 470 predicted protein-coding sequences, which correspond to approximately 88% of genomic DNA (Figure 4.21). Many of these sequences were identified as genes encoding proteins involved in DNA replication, transcription, translation, membrane transport, and energy metabolism. However, M. genitalium contains many fewer genes for metabolic enzymes than does H. influenzae, reflecting its more limited metabolism. For example, many genes known to encode components of biosynthetic pathways are lacking in the genome of M. genitalium, consistent with its need to obtain amino acids and nucleotide precursors from a host organism. Interestingly, the Mycoplasma genome also includes approximately 150 genes of currently unknown function. Thus, even in the simplest of cells, the biological roles of many genes remain to be determined.
The sequence of the genome of the archaebacterium Methanococcus jannaschii, reported in 1996, provided major insights into the evolutionary relationships between the archaebacteria, eubacteria, and eukaryotes. The genome of M. jannaschii is 1.7 Mb and contains 1738 predicted protein-coding sequences-similar in size to the genome of H. influenzae. However, only about one-third of the protein-coding sequences identified in M. jannaschii were related to known genes of either eubacteria or eukaryotes, indicating the distinct genetic composition of the archaebacteria. The genes of M. jannaschii encoding proteins involved in energy production and biosynthesis of cell constituents are related to those of eubacteria, suggesting that basic metabolic processes evolved in a common ancestor of both the archaebacteria and the eubacteria. Importantly, however, the M. jannaschii genes encoding proteins involved in DNA replication, transcription, and translation are more closely related to those of eukaryotes than to those of eubacteria. Genomic sequencing of this archaebacterium thus indicates that the archaebacteria and eukaryotes share a common line of evolutionary descent and are more closely related to each other than either is to the eubacteria (see Figure 1.8).
Although the relative simplicity and facile genetics of E. coli have made it a favored organism of molecular biologists, the 4.6-Mb E. coli genome was not completely sequenced until 1997. Analysis of the E. coli sequence revealed a total of 4288 genes, with protein-coding sequences accounting for 88% of the E. coli genome. Of the 4288 genes revealed by sequencing, 1835 had been previously identified and the functions of an additional 821 could be deduced by comparisons to the sequences of characterized genes of other organisms. However, the functions of 1632 E. coli genes (nearly 40% of the genome) could not be determined. Thus, even for an organism as thoroughly studied as E. coli, genomic sequencing demonstrates that a great deal remains to be learned about prokaryotic cell biology.
As noted already, the simplest eukaryotic genome (1.2 × 107 base pairs of DNA) is found in the yeast Saccharomyces cerevisiae. Moreover, yeasts grow rapidly and are subject to simple genetic manipulations. Thus, in many ways yeasts are model eukaryotic cells that can be studied much more readily than the cells of mammals or other higher eukaryotes. Consequently, the complete sequencing of an entire yeast chromosome in 1992 (Figure 4.22), followed by determination of the sequence of the complete yeast genome in 1996, were major steps in understanding the molecular biology of eukaryotic cells.
The S. cerevisiae genome contains about 6000 genes, including 5885 predicted protein-coding sequences, 140 ribosomal RNA genes, 275 transfer RNA genes, and 40 genes encoding small nuclear RNAs involved in RNA processing (discussed in Chapter 6). Yeast thus have a high density of protein-coding sequences, similar to bacterial genomes, with protein-coding sequences accounting for approximately 70% of total yeast DNA. Consistent with this, only 4% of yeast genes were found to contain introns. Moreover, those S. cerevisiae genes that do contain introns usually have only a single small intron near the beginning of the gene.
Computer analysis was able to assign a predicted function to approximately 3000 of the S. cerevisiae protein-coding sequences based on similarities to the sequences of known genes. Based on analysis of these genes, it appears that approximately 11% of yeast proteins function in metabolism, 3% in the production and storage of metabolic energy, 3% in DNA replication, repair, and recombination, 7% in transcription, 6% in translation, and 14% in protein sorting and transport. However, the functions of many of these genes are only known in general terms (such as “transcription factor”), so their precise roles within the cell still need to be determined. Moreover, since half of the proteins encoded by the yeast genome were unrelated to previously described genes, the functions of an additional 3000 unknown proteins remain to be elucidated by genetic and biochemical analyses.
Now that the sequence of the yeast genome has been completed, determination of the functions of these many new genes has become a major goal. Fortunately, yeasts are particularly amenable to functional analyses of unknown genes because of the facility with which normal chromosomal loci can be inactivated by homologous recombination with cloned sequences (see Figure 3.42). Therefore, direct functional analysis of yeast genes that were initially identified only on the basis of their nucleotide sequence can be systematically undertaken. Sequencing the yeast genome has thus opened the door to studying many new areas of the biology of this simple eukaryotic cell. Such studies are expected to reveal the functions of many new genes that are not restricted to yeasts but are common to all eukaryotes, including humans.
The genomes of C. elegans, Drosophila, and Arabidopsis are intermediate in size and complexity between those of yeasts and humans. Distinctive features of each of these organisms make them important models for genome analysis: Arabidopsis provides a simple model for a plant genome, C. elegans is widely used for studies of animal development, and Drosophila has been especially well analyzed genetically. The genomes of these organisms, however, are about tenfold larger than those of yeasts, introducing a new order of difficulty in genome mapping and sequencing. Nonetheless, the complete genome sequences of C. elegans and Drosophila have been determined, and sequencing of Arabidopsis is rapidly progressing.
Determination of the sequence of C. elegans in 1998 extended genome sequencing from unicellular organisms (bacteria and yeast) to a multicellular organism recognized as an important model for animal development. The initial phases of analysis of the C. elegans genome used DNA fragments cloned in cosmids, which accommodate DNA inserts of approximately 40 kb (see Figure 3.23). This approach, however, was unable to cover the complete genome, which was accomplished by the cloning of much larger pieces of DNA in yeast artificial chromosome (YAC) vectors (Figure 4.23). The unique feature of YACs is that they contain centromeres and telomeres, allowing them to replicate as linear chromosome-like molecules in yeasts. They can therefore be used to clone DNA fragments the size of yeast chromosomal DNAs-up to thousands of kilobases in length. The large DNA inserts that can be cloned in YACs are critically important for mapping complex genomes, as will be even more apparent in our following discussion of mapping genomes of still greater complexity, such as that of humans.
The C. elegans genome is 97 × 106 base pairs and contains 19,099 predicted protein-coding sequences-approximately three times the number of genes in yeast and one-fifth the number of genes predicted in humans. In contrast to the compact genome organization of yeast, genes in C. elegans span about 5 kilobases and contain an average of five introns. Protein-coding sequences thus account for only about 25% of the C. elegans genome, as compared to 70% of yeast and nearly 90% of bacterial genomes.
Approximately 40% of the predicted C. elegans proteins displayed significant similarity to known proteins of other organisms. As expected, there are substantially more similarities between the proteins of C. elegans and humans than between C. elegans and either yeast or bacteria. Proteins that are common between C. elegans and yeast may function in the basic cellular processes shared by these organisms, such as metabolism, DNA replication, transcription, translation, and protein sorting. These core biological processes appear to be carried out by a similar number of genes in both organisms, and it is likely that these genes will be shared by all eukaryotic cells. In contrast, the majority of C. elegans genes are not found in yeast and are likely to function in the more intricate regulatory activities required for the development of multicellular organisms. Elucidating the functions of these genes is likely to be particularly exciting in terms of understanding animal development. Although adult C. elegans are about 1 mm long and contain only 959 somatic cells in the entire body, they have all of the specialized cell types found in more complicated animals. Moreover, the complete pattern of cell divisions leading to C. elegans development has been described, including analysis of the connections made by all 302 neurons in the adult animal. Many of the genes involved in C. elegans development and differentiation have already been found to be related to genes involved in controlling the proliferation and differentiation of mammalian cells, substantiating the validity of C. elegans as a model for more complex animals. With little doubt, many more critical developmental control genes will be uncovered from studies of the C. elegans genomic sequence.
Drosophila is another key model for animal development, which has been particularly well-characterized genetically. The advantages of Drosophila for genetic analysis include its relatively simple genome and the fact that it can be easily maintained and bred in the laboratory. In addition, a special tool for genetic analysis in Drosophila is provided by the giant polytene chromosomes that are found in some tissues, such as the salivary glands of larvae. These chromosomes arise in nondividing cells as a consequence of repeated replication of DNA strands that fail to separate from each other. Thus, each polytene chromosome contains hundreds of identical DNA molecules aligned in parallel. Because of their size, these polytene chromosomes are visible in the light microscope, and appropriate staining procedures reveal a distinct banding pattern (Figure 4.24). The banding of polytene chromosomes provides a much greater degree of resolution than that achieved with metaphase chromosomes (e.g., see Figure 4.14). The polytene chromosomes are decondensed interphase chromosomes that contain actively expressed genes. More than 5000 bands are visible, each corresponding to an average length of approximately 20 kb of DNA. In contrast, the bands identified in human metaphase chromosomes contain several megabases of DNA.
The banding pattern of polytene chromosomes thus provides a high-resolution physical map of the Drosophila genome. Gene deletions can often be correlated with the loss of a specific chromosomal band, thereby defining the physical location of the gene on the chromosome. In addition, cloned DNAs can be mapped by in situ hybridization to polytene chromosomes, often with sufficient resolution to localize cloned genes to specific bands (Figure 4.25). Thus, the map positions of cosmid or YAC clones (which span many bands) can readily be determined, providing the base for genomic sequence analysis.
Because of the power of Drosophila genetics, the complete sequencing of the Drosophila genome early in 2000 was a major milestone in genomic analysis. The genome of Drosophila consists of approximately 180 × 106 base pairs, of which about one-third is heterochromatin. The remaining 120 × 106 base pairs of the Drosophila genome contain approximately 13,600 genes, somewhat fewer than the number of genes in C. elegans. It is especially striking that a complex animal like Drosophila has only about twice the number of genes found in yeast, and the functional analysis of new genes that have been uncovered by sequencing the Drosophila genome will undoubtedly play a major role in understanding the molecular basis of animal development.
The complete sequences of Arabidopsis thaliana chromosomes 2 and 4 were reported in 1999. Together, these two chromosomes represent about 32% of the genome and contain approximately 7,800 genes. Some of these genes are related to genes of yeast and C. elegans, whereas others are novel and may represent genes that are specific for plants. The emerging sequence of the Arabidopsis genome is thus likely to reveal a total of about 20,000 genes, analysis of which can be expected to provide a new foundation for studies of plant biology.