The genomes of most eukaryotes are larger and more complex than those of prokaryotes (Figure 4.1). This larger size of eukaryotic genomes is not inherently surprising, since one would expect to find more genes in organisms that are more complex. However, the genome size of many eukaryotes does not appear to be related to genetic complexity. For example, the genomes of salamanders and lilies contain more than ten times the amount of DNA that is in the human genome, yet these organisms are clearly not ten times more complex than humans.
This apparent paradox was resolved by the discovery that the genomes of most eukaryotic cells contain not only functional genes but also large amounts of DNA sequences that do not code for proteins. The difference in the sizes of the salamander and human genomes thus reflects larger amounts of non-coding DNA, rather than more genes, in the genome of the salamander. The presence of large amounts of noncoding sequences is a general property of the genomes of complex eukaryotes. Thus, the thousandfold greater size of the human genome compared to that of E. coli is not due solely to a larger number of human genes. The human genome is thought to contain approximately 100,000 genes-only about 25 times more than E. coli has. Much of the complexity of eukaryotic genomes thus results from the abundance of several different types of noncoding sequences, which constitute most of the DNA of higher eukaryotic cells.
In molecular terms, a gene can be defined as a segment of DNA that is expressed to yield a functional product, which may be either an RNA (e.g., ribosomal and transfer RNAs) or a polypeptide. Some of the noncoding DNA in eukaryotes is accounted for by long DNA sequences that lie between genes (spacer sequences). However, large amounts of noncoding DNA are also found within most eukaryotic genes. Such genes have a split structure in which segments of coding sequence (called exons) are separated by noncoding sequences (intervening sequences, or introns) (Figure 4.2). The entire gene is transcribed to yield a long RNA molecule and the introns are then removed by splicing, so only exons are included in the mRNA. Although most introns have no known function, they account for a substantial fraction of DNA in the genomes of higher eukaryotes.
Introns were first discovered in 1977, independently in the laboratories of Phillip Sharp and Richard Roberts, during studies of the replication of adenovirus in cultured human cells. Adenovirus is a useful model for studies of gene expression, both because the viral genome is only about 3.5 × 104 base pairs long and because adenovirus mRNAs are produced at high levels in infected cells. One approach used to characterize the adenovirus mRNAs was to determine the locations of the corresponding viral genes by examination of RNA-DNA hybrids in the electron microscope. Because RNA-DNA hybrids are distinguishable from single-stranded DNA, the positions of RNA transcripts on a DNA molecule can be determined. Surprisingly, such experiments revealed that adenovirus mRNAs do not hybridize to only a single region of viral DNA (Figure 4.3). Instead, a single mRNA molecule hybridizes to several separated regions of the viral genome. Thus, the adenovirus mRNA does not correspond to an uninterrupted transcript of the template DNA; rather the mRNA is assembled from several distinct blocks of sequences that originated from different parts of the viral DNA. This was subsequently shown to occur by RNA splicing, which will be discussed in detail in Chapter 6.
Soon after the discovery of introns in adenovirus, similar observations were made on cloned genes of eukaryotic cells. For example, electron microscopic analysis of RNA-DNA hybrids and subsequent nucleotide sequencing of cloned genomic DNAs and cDNAs indicated that the coding region of the mouse β-globin gene (which encodes the β subunit of hemoglobin) is interrupted by two introns that are removed from the mRNA by splicing (Figure 4.4). The intron-exon structure of many eukaryotic genes is quite complicated, and the amount of DNA in the intron sequences is often greater than that in the exons. The chicken ovalbumin gene, for example, contains eight exons and seven introns distributed over approximately 7700 base pairs (7.7 kilobases, or kb) of genomic DNA. The exons total only about 1.9 kb, so approximately 75% of the gene consists of introns. An extreme example is the human gene that encodes the blood clotting protein factor VIII. This gene spans approximately 186 kb of DNA and is divided into 26 exons. The mRNA is only about 9 kb long, so the gene contains introns totaling more than 175 kb. On average, introns are estimated to account for about ten times more DNA than exons in the genes of higher eukaryotes.
Introns are present in most genes of complex eukaryotes, although they are not universal. Almost all histone genes, for example, lack introns, so introns are clearly not required for gene function in eukaryotic cells. In addition, introns are not found in most genes of simple eukaryotes, such as yeasts. Conversely, introns are present in rare genes of prokaryotes. The presence or absence of introns is therefore not an absolute distinction between prokaryotic and eukaryotic genes, although introns are much more prevalent in higher eukaryotes (both plants and animals), where they account for a substantial amount of total genomic DNA.
Most introns have no known cellular function, although a few have been found to encode functional RNAs or proteins. Introns are generally thought to represent remnants of sequences that were important earlier in evolution. In particular, introns may have helped accelerate evolution by facilitating recombination between protein-coding regions (exons) of different genes-a process known as exon shuffling. Exons frequently encode functionally distinct protein domains, so recombination between introns of different genes would result in new genes containing novel combinations of protein-coding sequences. As predicted by this hypothesis, DNA sequencing studies have demonstrated that some genes are chimeras of exons derived from several other genes, providing direct evidence that new genes can be formed by recombination between intron sequences.
It appears most likely that introns were present early in evolution, prior to the divergence of prokaryotic and eukaryotic cells. According to this hypothesis, introns played an important role in the initial assembly of protein-coding sequences in the ancient ancestors of present-day cells. Introns were subsequently lost from most genes of prokaryotes and simpler eukaryotes (e.g., yeasts) in response to evolutionary selection for rapid replication, which led to streamlining the genomes of these organisms. However, since rapid cell division is not an advantage to higher eukaryotes, introns have been retained in their genomes. Alternatively, introns may have arisen later in evolution as a result of the insertion of DNA sequences into genes that had already been formed as continuous protein-coding sequences. Exon shuffling would then have played an important role in the further evolution of genes in higher eukaryotes but would not account for the initial assembly of protein-coding sequences prior to the evolutionary divergence of prokaryotic and eukaryotic cells.
Another factor contributing to the large size of eukaryotic genomes is that some genes are repeated many times. Whereas most prokaryotic genes are represented only once in the genome, many eukaryotic genes are present in multiple copies, called gene families. In some cases, multiple copies of genes are needed to produce RNAs or proteins required in large quantities, such as ribosomal RNAs or histones. In other cases, distinct members of a gene family may be transcribed in different tissues or at different stages of development. For example, the α and β subunits of hemoglobin are both encoded by gene families in the human genome, with different members of these families being expressed in embryonic, fetal, and adult tissues (Figure 4.5). Members of many gene families (e.g., the globin genes) are clustered within a region of DNA; members of other gene families are dispersed to different chromosomes.
Gene families are thought to have arisen by duplication of an original ancestral gene, with different members of the family then diverging as a consequence of mutations during evolution. Such divergence can lead to the evolution of related proteins that are optimized to function in different tissues or at different stages of development. For example, fetal globins have a higher affinity for O2 than do adult globins-a difference that allows the fetus to obtain O2 from the maternal circulation.
As might be expected, however, not all mutations enhance gene function. Some gene copies have instead sustained mutations that result in their loss of ability to produce a functional gene product. For example, the human α– and β-globin gene families each contain two genes that have been inactivated by mutations. Such nonfunctional gene copies (called pseudogenes) represent evolutionary relics that significantly increase the size of eukaryotic genomes without making a functional genetic contribution.
A substantial portion of eukaryotic genomes consists of highly repeated noncoding DNA sequences. These sequences, sometimes present in hundreds of thousands of copies per genome, were first demonstrated by Roy Britten and David Kohne during studies of the rates of reassociation of denatured fragments of cellular DNAs (Figure 4.6). Denatured strands of DNA hybridize to each other (reassociate), re-forming double-stranded molecules (see Figure 3.28). Since DNA reassociation is a bimolecular reaction (two separated strands of denatured DNA must collide with each other in order to hybridize), the rate of reassociation depends on the concentration of DNA strands. When fragments of E. coli DNA were denatured and allowed to hybridize with each other, all of the DNA reassociated at the same rate, as expected if each DNA sequence were represented once per genome. However, reassociation of fragments of DNA extracted from mammalian cells showed a very different pattern. Approximately 60% of the DNA fragments reassociated at the rate expected for sequences present once per genome, but the remainder reassociated much more rapidly than expected. The interpretation of these results was that some sequences were present in multiple copies and therefore reassociated more rapidly than those sequences that were represented only once per genome. In particular, these experiments indicated that approximately 40% of mammalian DNA consists of highly repetitive sequences, some of which are repeated 105 to 106 times.
Further analysis has identified several types of these highly repeated sequences. One class (called simple-sequence DNA) contains tandem arrays of thousands of copies of short sequences, ranging from 5 to 200 nucleotides. For example, one type of simple-sequence DNA in Drosophila consists of tandem repeats of the seven nucleotide unit ACAAACT. Because of their distinct base compositions, many simple-sequence DNAs can be separated from the rest of the genomic DNA by equilibrium centrifugation in CsCl density gradients. The density of DNA is determined by its base composition, with AT-rich sequences being less dense than GC-rich sequences. Therefore, an AT-rich simple-sequence DNA bands in CsCl gradients at a lower density than the bulk of Drosophila genomic DNA (Figure 4.7). Since such repeat-sequence DNAs band as “satellites” separate from the main band of DNA, they are frequently referred to as satellite DNAs. These sequences are repeated millions of times per genome, accounting for 10 to 20% of the DNA of most higher eukaryotes. Simple-sequence DNAs are not transcribed and do not convey functional genetic information. Some, however, may play important roles in chromosome structure.
Other repetitive DNA sequences are scattered throughout the genome rather than being clustered as tandem repeats. These sequences are classified as SINEs (short interspersed elements) or LINEs (long interspersed elements). The major SINEs in mammalian genomes are Alusequences, so-called because they usually contain a single site for the restriction endonuclease AluI. Alu sequences are approximately 300 base pairs long, and about a million such sequences are dispersed throughout the genome, accounting for nearly 10% of the total cellular DNA. Although Alu sequences are transcribed into RNA, they do not encode proteins and their function is unknown. The major human LINEs (which belong to the LINE 1, or L1, family) are about 6000 base pairs long and repeat approximately 50,000 times in the genome. L1 sequences are transcribed and at least some encode proteins, but like Alu sequences, they have no known function in cell physiology.
Both Alu and L1 sequences are examples of transposable elements, which are capable of moving to different sites in genomic DNA (see Chapter 5). Some of these sequences may help regulate gene expression, but most Alu and L1 sequences appear not to make a useful contribution to the cell. They may, however, have played important evolutionary roles by contributing to the generation of genetic diversity.
Having discussed several kinds of noncoding DNA that contribute to the genomic complexity of higher eukaryotes, it is of interest to consider the total number of genes in eukaryotic genomes (Table 4.1). Assuming that the average polypeptide is approximately 400 amino acids long, the average size of the coding sequence of a gene is 1200 base pairs. In bacterial genomes, most of the DNA encodes proteins. For example, the genome of E. coli is approximately 4.6 × 106 base pairs long and contains 4288 genes, with nearly 90% of the DNA used as protein-coding sequence.
Table 4.1. The Numbers of Genes in Cellular Genomes
|Organism||Genome size (Mb)a||Protein-coding sequence (percent)b||Number of genesb|
a Mb = millions of base pairs.
b Determined from the E. coli, S. cerevisiae, C. elegans, and Drosophila genomic sequences and estimated for the human genome.
The yeast genome, which consists of 12 × 106 base pairs, is about 2.5 times the size of the genome of E. coli, but is still extremely compact. Only 4% of the genes of Saccharomyces cerevisiae contain introns, and these usually have only a single small intron near the start of the coding sequence. The average gene in yeast spans about 2000 base pairs, and approximately 70% of the yeast genome is used as protein-coding sequence, specifying a total of about 6000 proteins.
The genome of the nematode C. elegans, a relatively simple animal genome, is intermediate in size and complexity between the genomes of yeast and mammals. The C. elegans genome is 97 × 106 base pairs and contains approximately 19,000 protein-coding genes. Thus, while the genome of C. elegans is 8 times larger than that of yeast, it contains only about three times the number of genes. This correlates with the presence of a substantial number of introns in C. elegans. Each gene in C. elegans contains an average of five introns and spans an average of 5000 bases. Consistent with this, only about 25% of the C. elegans sequence corresponds to exons, versus 70% protein-coding sequence in the yeast genome. Although the genome of Drosophila is 180 × 106 base pairs, Drosophila contains fewer genes than C. elegans (about 13,600). Protein-coding sequence thus corresponds to only about 13% of the Drosophila genome.
The genomes of higher animals (such as humans) are still more complex and contain large amounts of noncoding DNA. Thus, only a small fraction of the 3 × 109 base pairs of the human genome is expected to correspond to protein-coding sequence. Approximately one-third of the genome corresponds to highly repetitive sequences, leaving an estimated 2 × 109 base pairs for functional genes, pseudogenes, and nonrepetitive spacer sequences. If the average gene spans 10,000-20,000 base pairs (including introns), one might expect the human genome to consist of about 100,000 genes, with protein-coding sequences corresponding to only about 3% of human DNA. Although this estimate is generally accepted as plausible, it remains to be verified or corrected by the final results of human genome sequencing.