- My name is Bob Tjian, I'm a Professor of MCB at the University of California at Berkeley, and I'm also serving as the President of the Howard Hughes. I'm going to spend the next 25 or 30 minutes telling you about some fundamentals of one of the most important molecular processes in living cells, which is the expression of genes through a process called transcription. Now, first, to understand what gene expression means, you have to have a sense of what we tend to refer to in the field as a central dogma of molecular biology. Another way to think about this is the flow of biological information from DNA, in other words, our chromosomes, which every cell has its compliment, to be transcribed into a sister molecule called RNA. So, this process of converting DNA into RNA is called transcription, and that is the topic of this lecture. This process is very complicated, as you'll see by the end of my two lectures, and it is very important for many, many fundamental processes in biology. So, what I'm gonna spend today's lecture on is the discovery of a large family of transcription proteins. These are factors we call them that are key molecules that regulate the use of genetic information that has been encoded in the genome. Now, transcription factors or proteins are involved in many fundamental aspects of biology, including embryonic development, cellular differentiation, and cell fate. In other words, pretty much what your cells are doing, how a tissue works, and how an organism survives and reproduces is dependent on the process of gene expression. And the first step in this process is transcription. Now, there are many other reasons why a large group of people and scientists are interested in transcription, and another reason is that understanding the fundamental molecular mechanisms that controls transcription in humans or in any other organism can inform us and teach us about what happens when something goes wrong, for example, in diseases. And I list here just a few diseases that we could study as a result of understanding the structure and function of these transcription factor proteins that I'm going to be telling you about. And of course, the hope is that in understanding the molecular underpinnings of complex diseases like cancer, diabetes, Parkinson's, and so forth, that we will be able to develop and use better, more specific therapeutic drugs and also to develop more accurate and rapid diagnostic tools. So, those are a couple of the reasons why many of us have spent, in my case, over 30 years studying this process of transcriptional regulation. Now, to get the whole thing started, I have to give you a sense of what the magnitude of the problem is. So, imagine that one would really like to understand how this process of decoding the genome happens in humans. So, as you may know, the human genome has some 3 billion base pairs or bits of genetic information, and that encodes roughly 22,000 genes. These are stretches of DNA sequence that encode ultimately a product that is a protein which actually makes the cells function. So, as I already explained to you, there's this flow of biological information where you have to extract the information buried in DNA, convert it into RNA. And what I'm not gonna tell you about today is the process of going from RNA to protein, which is a reaction called a translational reaction. I'm going to instead just focus on the first step of converting DNA into RNA, which is the process of transcription. Now, one of the most amazing results that we got over the last decade or so was when the human genome was entirely sequenced, the first few that were sequenced, we realized that actually the number of genes in humans is not vastly different from many other organisms, even simple organisms like little worms or fruit flies and so forth. That is roughly 22 to 25,000 genes is all the number of genes that all of these different organisms have. And yet, anybody looking at us versus a little roundworm in the soil or a fruit fly can tell that we're a much more complex organism with a much bigger brain, much more complex behavior, and so forth. So, how does this happen? Part of the answer to this very interesting mystery or paradox lies in the way that genes are organized and how they're regulated. And one of the most striking results of the genome sequencing project was to realize that a vast, vast majority of the DNA in our chromosomes is actually not coding for specific gene products, and that only roughly 3% of the DNA is actually encoding let's call those little arrows that I show you on this purple DNA are the gene coding regions. So, you'll notice that there's a lot of non-arrow sequences, which I'll show you in this next slide as green. These are non-coding regions. So, the vast majority, 97% or greater is non-coding, so what are these other sequences doing? And of course, it turns out that these sequences carry very important little fragments of DNA, which we call regulatory sequences. And these are the sequences that actually control whether a gene gets turned on or not. And I'll be spending much of the next 20 minutes telling you about how this process all works and what these little bits of DNA sequences actually function to control gene expression. Now, the other thing that I have to bring you up to date on is this mysterious process we're calling transcription, which reads double-stranded DNA and then makes a related molecule, which is a single-stranded RNA molecule, which is a informational molecule. That reaction is catalyzed by a very complex multi-subunit enzyme called RNA polymerase II. Now, there's the roman numeral II at the end of this because there were actually three enzymes in most mammals, at least three enzymes that carry out different processes and different types of RNA production. But I'm only gonna tell you about the ones that make the classical messenger RNA, which then ultimately becomes proteins. So, now one of the things that we learned early on in the study of mammalian or other multicellular organism transcription processes is that despite the fact that this enzyme is quite complex in its structure, it turns out to be an enzyme that's nevertheless needs a lot of help to do its job. So, on its own, this RNA polymerase II cannot tell the difference between the non-coding regions of the genome and places where it's supposed to be coding or reading to make the appropriate messenger RNAs. So, this sort of leads you to think that there must be a number of other factors that somehow direct RNA polymerase to the right place at the right time in the genome of every cell in your body so that the right products get made so each cell in your body is functioning properly. And this is where things get really interesting is some 25, 30 years ago, a number of laboratories took on the job of hunting for these elusive and, as it turned out, a specialized protein factors that recognize these little stretches of DNA sequences that I've been telling you about that make up the vast majority of the non-coding part of the genome. And how these proteins then can recognize and ultimately physically interact with these little bits of genetic information to then turn genes on or off. Now, in this lecture, I can't go into all the details of the types of experiments or the ranges of experiments that many, many laboratories have done over the last two decades to finally work out this molecular puzzle of how transcription works. But I can tell you that there are fundamentally two major approaches that have been taken over the last few decades to kind of get a parts list of the machinery that decodes the genome and carries out the process of transcription. One is kind of the old style, I'll call it bucket biochemistry or take a live cell, crush it up, spread out all of its parts and then try to figure out how to put it back together again, that's what I call in vitro biochemistry. And the other one is in vivo genetics where you effectively use genetic tools, mutagenesis to go in there and selectively remove or knock down or knock out certain genes and gene products and then ask what is the consequence on that cell or that organism? Both of these technologies are very powerful and highly complementary, and they continue to be used. Today, I will focus primarily on the in vitro biochemical techniques which led us to the discovery of the first few classes of transcription factors. And in subsequent lectures, we'll go to more recent technologies that allows us to sort of speed up this whole process of identifying key regulatory molecules and how they work. So, let's go back to the sort of the basic unit of gene expression, which is a gene, here shown in the orange arrow, and the non-coding sequences surrounding it. And you'll see that now I've added a few more elements to this purple DNA. You see some symbols, a blue square, a round circle that's pink, and then a yellow triangle. Those are just a way for me to graphically represent the little bits of DNA sequences that I told you about that are the regulatory sequences. So, the little round one happens to very GC-rich, the triangle one is a classical element that's called a TATA box, I'll tell you about a little bit later. And the blue one is yet another recognition element. So, why are we so interested in these little stretches of nucleic acid sequence in the genome when it's buried amongst billions of other sequences? Well, these individual little sequences turn out to be very important because of where they sit, you'll notice they're sitting near the top of the arrow, and they are recognized by very special proteins which are the transcription factors. So, now I've showing you some symbols with little cutouts which fit into either the square, the circle, or the triangle. So, transcription factors, at least one major family of transcription factors, are proteins whose three-dimensional structure is folded into a shape that allows them to recognize these short stretches of double-stranded DNA. In fact, largely through interactions with the major group of DNA, and I'll show you a structure of one in a little bit. So, now it turns out that there are probably thousands of these transcription factors because the number of genes that we have to control, as I showed you, is in the order of 20 or 25,000 genes. And so, it turns out that you need a pretty large percentage of the genome devoted to encoding these regulatory proteins in order for a complex organism like ourselves to survive. Then the other component of this, let's call it the transcriptional apparatus, is, of course, the enzyme that catalyzes RNA. And I already told you that this enzyme on its own can't tell the difference between random DNA sequence and a gene or a promoter. These other sequence-specific DNA-binding proteins are the ones that must recruit or otherwise direct RNA polymerase to essentially land on the right place and at the right time in the genome to turn on a certain subset of genes that are specifically required in a specialized cell type, whatever cell you happen to be looking at. So, that is kind of the first level of complexity of sort of informational interactions between the transcription factors and the more ubiquitous, and I would call it promiscuous RNA polymerase II enzyme. Well, as it turns out, it took several decades to work out most if not all of the components of this so-called transcriptional machinery. And it turns out in this slide I'm showing you things are already starting to get more complicated. So, not only do you have RNA polymerase, but you have a bunch of other proteins that go by names like TFIIA, B, you know, D, E, H, F, and so forth. So, it looks like there are going to be many, many proteins that are necessary to form the transcriptional apparatus. And then on top of that you need sequence-specific DNA-binding proteins which are already described to you to further inform or otherwise regulate the process of when a particular RNA polymerase molecule should be binding to a particular gene. So, that's the sort of overview, now let me get into the specifics and how did we actually discover these family of proteins. And it'll be interesting for you to see how science in this field evolved. Now, as is often the case when you first try to tackle a very complex problem, and, of course, we didn't really know how complex it was when we began these studies, but we assumed it might be complicated, certainly would be more complicated than systems that we had already had some idea about, for example, in bacteria or in bacteriophages. We took a lesson from our studies of bacteriophages and decided that to begin to dissect the molecular complexities of the transcription process in animal cells, we should start with viruses because we knew that viruses will enter these host cells, these complex cells that we ultimately want to study and have to use the same molecular machinery to transcribe their genes as the host mammalian cell would do. So, this was kind of a trick or a way to look at a molecular window into a complex system and try to simplify it. And in our case, the early studies of the late '70s and early '80s involved very simple, one of these simplest double-stranded DNA viruses called Simian virus 40. And Simian virus 40, of course, is a monkey virus, which was nice because it's very close to humans and many things that we could learn about the way this virus uses its host, which are monkey cells, to replicate and to express their RNAs and genes would be applicable to our studies of humans, as you'll see. And this virus was one of the first whose DNA, its double-stranded DNA of about 5,000 base pairs was fully sequenced. This was long before a rapid modern day sequencing was available, so this gave us a very powerful tool. It basically allowed us to look at the entire genome of this virus, which was tiny by comparison, only 5,243 base pairs. But just that information was already very important 'cause it very quickly allowed us, for example, to map where the genes are. And one of the genes encoded a protein called a tumor antigen, which turns out to be a transcription factor. This then allowed us to get our hands basically to do biochemistry and genetics on the very first eukaryotic transcription factor, which in this case happens to be a represser. That is a protein that when it binds the DNA just the same way as I showed you for the the model case, it binds through specific protein DNA interactions. But in this case, actually shuts transcription down rather than turn it up. In the process of studying the way that this little virus when it infects a mammalian cell uses proteins like T-antigen to regulate its gene expression, it became clear that it had to use the host machinery to do the process. And that meant that there must be monkey proteins that are also involved in activating or repressing genes of this virus. And this then led us to the most important step, which is to transfer the technology we learned about viruses and how to work with the virus transcription factor like T-antigen to the cellular ones. And I'm gonna give you just one example of how the simple jump into the host cell allowed us to discover the first human transcription factor. So, the question that we then asked back in the early 1980s was what host molecule is regulating the expression of transcription of this virus when the virus is in the host? And we knew from the DNA sequence of the virus that there were these six very GC-rich snippets of DNA that were regulatory 'cause if we deleted them, the virus no longer would express the gene of interest. So, we knew that something was probably responsible for recognizing these GC boxes, and we knew that it wasn't a virally encoded gene because we had tested all of the viral genes of which there were only six to begin with. So, we knew it had to be a host gene and that led us to a whole, I would say, family of experiments that led to the discovery of sequence-specific mammalian transcription factors. And as I said, we could have taken multiple approaches to try to address this complicated issue. I'll just give you one example of using in vitro biochemistry to finally get our hands on this key sequence specific human transcription factor, which, of course, has a homologue in the monkey. And the way we did it was very interesting and simple in retrospect, and that is recognizing the fact that whatever this protein was, it had to have the property of recognizing those GC boxes that were sitting next to the the viral gene. We assume that it must be a sequence-specific DNA binding-protein, so all we had to do was figure out a way to extract proteins from human cells or monkey cells and then try to fish out those specific proteins out of the many thousands of different proteins that were in this gamish of cellular extract that would be responsible for discriminating between random DNA sequences and the specific GC box. And I'll quickly run through sort of the logic behind this. So, what I'm showing you here is a solid surface with DNA coupled to it that is highly enriched for the recognition element, the GC box, which should be the sequence recognized by the protein of interest. Now, we had no idea what this protein was gonna look like, how many proteins there were gonna be, and so forth, but we knew it had to recognize the GC box. So, we're gonna try to fish this out of a pool of many thousands of other proteins. Now, the the key trick here was that because all cell extracts contain not only one DNA binding protein, but, as I told you, thousands of different DNA binding proteins. But most of them, or in fact in our case, none of the other of several hundred to a thousand proteins that could bind DNA actually happen to recognize the GC box, they just bind other DNA sequences. So, to kind of favor our protein being able to bind to our GC box and not have to compete with all the other proteins, what we did was to add non-specific DNA and mask stoichiometric excess so that all the other proteins that wouldn't recognize the GC box would still have some partner to hang onto. And this trick worked very well. So, having the specific DNA on the solid resin and the non-specific DNA flowing all over the place, we could capture selectively the pink molecules here, which are the GC box recognition ones, and the blue-green molecules, of course, predominantly bind to non-specific DNA. I show you one little blue one on the column because nothing works perfectly in real science and tells you that we have to go through this process iteratively to actually finally obtain a preparation that's purely pink molecules with no green-blue ones. Well, that turned out to work very, very well. And that whole process of biochemical fractionation followed by a direct affinity sequence-specific DNA resin gave us the ability to perform a biochemical purification followed by a molecular cloning of the transcription factor that encodes the protein SP1. And then we carried out a bunch of experiments, which I'll tell you next, to show that this protein actually does activate transcription. And of course, we went back and we proved that this protein, which turned out to be a rather large polypeptide, can indeed recognize the GC box. And it doesn't matter if it's a GC box from the SV 0 genome or any other GC box that we could find in the human genome, it would find that sequence and bind to it and then it would generally activate transcription. So, this led to the discovery of the first of a very large family of sequence-specific DNA-binding proteins. Now, I told you that the way these proteins tend to recognize short DNA sequences is to interact with DNA through the major groove. And here's a perfect example. So, the thick blue model there shows the actual three structures that are called zinc fingers. And the reason they're called zinc fingers is because there are amino acids that are organized around a center that contains a zinc molecule which holds the three-dimensional shape of the polypeptide in a position just right for fitting into the major groove of the DNA. And the DNA here is shown in pink, and you can see that that blue outline fits right into the major groove of the DNA, but not to the minor groove. And one of the most important findings was not only the discovery of the first human transcription factor, but the realization that most if not all sequence-specific DNA-binding transcription factors have a similar structural motif. That is to say some structure is built to recognize sequences in the major groove of DNA. And these three-dimensional motifs are recognizable as amino acid sequences in the genome. So, we can now much more quickly scan the entire sequence of a genome and identify genes that are likely to be DNA-binding proteins as a result of understanding the structure-function relationships of these DNA-binding motifs like zinc fingers. So, what I'd like to show you now is that I've only introduced you to one class of transcription factors, which are the sequence-specific-DNA binding proteins. Well, I think I gave you a little taste of the level of complexity that's probably going to be needed to be able to build the machine that's ultimately going to be able to allow you to transcribe every gene in every cell of a human body. So, that turns out to be a much more elaborated machine than what I just showed you. So, I wanna show you now what is sort of our state-of-the-art thinking about what is actually needed to build the machinery at a gene to allow it to be expressed and transcribed. And the term I want to introduce you to is the pre-initiation complex. And it's pretty much what it says. It's the complex of multiple subunits that has to essentially land on the promoter of a gene which will be designated for later expression. And this is a process that is probably quite orderly, that is there's an order of events that happens, which we, by the way, are not entirely sure exactly what the order is or even if the order is the same from one gene to the next, but we can kind of see where it starts and where it ends up. And the pathway in between, I would say is still a little bit murky. And the story here again starts with a little snippet of DNA called the TATA box, which I already introduced you to briefly. It's an AT-rich sequence which sits at the five prime end or the beginning of many genes, but not all genes, maybe 20% of the genes might contain this AT-rich region. And that AT sequence is the signal or a landmark, if you like, for a particular protein to bind to it. And that protein is called, not surprisingly, the TATA-binding protein 'cause it's the TATA sequence. And so, this represents a second class of transcription factors. These are not the type that I just introduced you to, which are gonna be different for every gene, the TATA sequence is present in a very large number of genes, so it can't be gene specific, but it turns out to be very crucial for our understanding of how gene regulation works. So, so you start with a TATA-binding protein finding a TATA box. We later found out that the TATA-binding protein rarely functions on its own and has a bunch of friends that we call TAFs or TBP associated factors. And now you're talking about an assembly of multi-subunit complex of almost a million daltons. There are somewhere between 12 to 15 subunits in addition to the TATA-binding protein that make up this little complex of proteins that kind of travels around together. And this is found in most cell types, and later on I'll show you in a subsequent lecture that not every cell type might have exactly the same compliment of these subunits, but many of them have this prototypic complex. Is this enough for building the pre-initiation complex? Unfortunately not. It turns out that there are a host of other, I'll call them ancillary factors in addition to the multi-subunit RNA polymerase itself that are necessary for you to build up an ensemble that is necessary to form an active ready to activate transcriptional pre-initiation complex or the PIC. And this is kind of the picture we're getting to, and even this picture with many, many colors and many, many different polypeptides, you know, that adds up to probably greater than 85 individual proteins that all have to kind of fit together like a jigsaw puzzle. It's probably not even the whole story, you'll notice I still have one big red question mark there because I think as we begin to study specific cell types and specific processes like embryonic development or germ layer formation, additional components that are not present here in this prototypic pre-initiation complex will come into play, and that's a subject of subsequent lecture. But already you can tell that the transcriptional machinery is anything but simple. So, can we get a better idea of what transcription might actually look like and what's happening when a transcription process takes place? So, let me first of all say that I'm gonna finish my lecture now with a little cartoon, which is our attempt to imagine the events that take place when you form a pre-initiation complex, you bring regulatory proteins to the activated gene and what happens during this process. Now, keep in mind that this is at this point mostly a cartoon that is in our imagination and only parts or if any of this is probably real, but it gives you a sense of the complexity of the transactions that have to take place just for one gene to transcribe and express itself. So, let me show you the movie, and then we'll finish just by keeping in mind that there's much to be learned. And in my next lecture, we'll go into the selectivity of this process in specialized cell types. So, now let's see what this sort of this cartoon of transcription looks like. So, we start off with DNA with some preassembled TFIID molecule, and along comes this other green molecule, which is actually a co-factor, which then forms this very large complex with RNA polymerase. And then a distal activator protein came in and activated the process. And this molecule, this bluish molecule that's moved away from the complex is actually the RNA polymerase. And that little yellow sort of bead on a string is actually the RNA product. So, that gives you a sense of things have to happen quickly and yet it involves many, many molecules having to assemble and then disassemble to give you this reaction to happen. And in my next lecture, we'll go into more specific aspects of this reaction, and particularly during embryonic development and tissue-specific gene expression.