WEBVTT 00:00:00.240 --> 00:00:03.720 - My name is Bob Tjian, I'm a Professor of MCB 00:00:03.720 --> 00:00:06.000 at the University of California at Berkeley, 00:00:06.000 --> 00:00:10.020 and I'm also serving as the President of the Howard Hughes. 00:00:10.020 --> 00:00:13.440 I'm going to spend the next 25 or 30 minutes 00:00:13.440 --> 00:00:15.870 telling you about some fundamentals 00:00:15.870 --> 00:00:19.620 of one of the most important molecular processes 00:00:19.620 --> 00:00:23.490 in living cells, which is the expression of genes 00:00:23.490 --> 00:00:25.593 through a process called transcription. 00:00:26.610 --> 00:00:31.610 Now, first, to understand what gene expression means, 00:00:32.190 --> 00:00:36.360 you have to have a sense of what we tend to refer to 00:00:36.360 --> 00:00:39.900 in the field as a central dogma of molecular biology. 00:00:39.900 --> 00:00:41.280 Another way to think about this 00:00:41.280 --> 00:00:45.877 is the flow of biological information from DNA, 00:00:46.950 --> 00:00:48.480 in other words, our chromosomes, 00:00:48.480 --> 00:00:50.523 which every cell has its compliment, 00:00:52.020 --> 00:00:56.730 to be transcribed into a sister molecule called RNA. 00:00:56.730 --> 00:00:59.940 So, this process of converting DNA into RNA 00:00:59.940 --> 00:01:01.800 is called transcription, 00:01:01.800 --> 00:01:05.370 and that is the topic of this lecture. 00:01:05.370 --> 00:01:08.760 This process is very complicated, 00:01:08.760 --> 00:01:11.640 as you'll see by the end of my two lectures, 00:01:11.640 --> 00:01:13.260 and it is very important 00:01:13.260 --> 00:01:18.210 for many, many fundamental processes in biology. 00:01:18.210 --> 00:01:20.910 So, what I'm gonna spend today's lecture on 00:01:20.910 --> 00:01:24.840 is the discovery of a large family 00:01:24.840 --> 00:01:27.030 of transcription proteins. 00:01:27.030 --> 00:01:31.380 These are factors we call them that are key molecules 00:01:31.380 --> 00:01:35.370 that regulate the use of genetic information 00:01:35.370 --> 00:01:37.863 that has been encoded in the genome. 00:01:38.730 --> 00:01:41.670 Now, transcription factors or proteins 00:01:41.670 --> 00:01:46.050 are involved in many fundamental aspects of biology, 00:01:46.050 --> 00:01:48.270 including embryonic development, 00:01:48.270 --> 00:01:51.660 cellular differentiation, and cell fate. 00:01:51.660 --> 00:01:55.140 In other words, pretty much what your cells are doing, 00:01:55.140 --> 00:01:56.400 how a tissue works, 00:01:56.400 --> 00:01:59.970 and how an organism survives and reproduces 00:01:59.970 --> 00:02:03.480 is dependent on the process of gene expression. 00:02:03.480 --> 00:02:06.873 And the first step in this process is transcription. 00:02:09.150 --> 00:02:11.760 Now, there are many other reasons 00:02:11.760 --> 00:02:14.940 why a large group of people and scientists 00:02:14.940 --> 00:02:16.410 are interested in transcription, 00:02:16.410 --> 00:02:19.350 and another reason is that understanding 00:02:19.350 --> 00:02:21.750 the fundamental molecular mechanisms 00:02:21.750 --> 00:02:24.990 that controls transcription in humans 00:02:24.990 --> 00:02:26.560 or in any other organism 00:02:27.420 --> 00:02:31.800 can inform us and teach us about what happens 00:02:31.800 --> 00:02:34.860 when something goes wrong, for example, in diseases. 00:02:34.860 --> 00:02:38.100 And I list here just a few diseases 00:02:38.100 --> 00:02:40.740 that we could study as a result 00:02:40.740 --> 00:02:43.380 of understanding the structure and function 00:02:43.380 --> 00:02:45.900 of these transcription factor proteins 00:02:45.900 --> 00:02:48.240 that I'm going to be telling you about. 00:02:48.240 --> 00:02:50.550 And of course, the hope is that in understanding 00:02:50.550 --> 00:02:54.780 the molecular underpinnings of complex diseases like cancer, 00:02:54.780 --> 00:02:57.900 diabetes, Parkinson's, and so forth, 00:02:57.900 --> 00:03:01.120 that we will be able to develop and use 00:03:02.130 --> 00:03:05.040 better, more specific therapeutic drugs 00:03:05.040 --> 00:03:07.890 and also to develop more accurate 00:03:07.890 --> 00:03:10.200 and rapid diagnostic tools. 00:03:10.200 --> 00:03:11.820 So, those are a couple of the reasons 00:03:11.820 --> 00:03:14.070 why many of us have spent, 00:03:14.070 --> 00:03:15.960 in my case, over 30 years 00:03:15.960 --> 00:03:19.173 studying this process of transcriptional regulation. 00:03:20.520 --> 00:03:23.010 Now, to get the whole thing started, 00:03:23.010 --> 00:03:24.360 I have to give you a sense 00:03:24.360 --> 00:03:27.180 of what the magnitude of the problem is. 00:03:27.180 --> 00:03:29.910 So, imagine that one would really like to understand 00:03:29.910 --> 00:03:34.770 how this process of decoding the genome happens in humans. 00:03:34.770 --> 00:03:35.970 So, as you may know, 00:03:35.970 --> 00:03:39.120 the human genome has some 3 billion base pairs 00:03:39.120 --> 00:03:41.400 or bits of genetic information, 00:03:41.400 --> 00:03:45.000 and that encodes roughly 22,000 genes. 00:03:45.000 --> 00:03:48.120 These are stretches of DNA sequence 00:03:48.120 --> 00:03:51.390 that encode ultimately a product 00:03:51.390 --> 00:03:55.650 that is a protein which actually makes the cells function. 00:03:55.650 --> 00:03:57.810 So, as I already explained to you, 00:03:57.810 --> 00:04:00.720 there's this flow of biological information 00:04:00.720 --> 00:04:04.140 where you have to extract the information buried in DNA, 00:04:04.140 --> 00:04:05.460 convert it into RNA. 00:04:05.460 --> 00:04:07.650 And what I'm not gonna tell you about today 00:04:07.650 --> 00:04:10.140 is the process of going from RNA to protein, 00:04:10.140 --> 00:04:13.950 which is a reaction called a translational reaction. 00:04:13.950 --> 00:04:17.040 I'm going to instead just focus on the first step 00:04:17.040 --> 00:04:18.720 of converting DNA into RNA, 00:04:18.720 --> 00:04:20.620 which is the process of transcription. 00:04:22.980 --> 00:04:26.850 Now, one of the most amazing results 00:04:26.850 --> 00:04:29.160 that we got over the last decade or so 00:04:29.160 --> 00:04:31.890 was when the human genome was entirely sequenced, 00:04:31.890 --> 00:04:34.530 the first few that were sequenced, 00:04:34.530 --> 00:04:38.130 we realized that actually the number of genes in humans 00:04:38.130 --> 00:04:42.420 is not vastly different from many other organisms, 00:04:42.420 --> 00:04:45.690 even simple organisms like little worms 00:04:45.690 --> 00:04:47.400 or fruit flies and so forth. 00:04:47.400 --> 00:04:51.210 That is roughly 22 to 25,000 genes 00:04:51.210 --> 00:04:53.010 is all the number of genes 00:04:53.010 --> 00:04:55.590 that all of these different organisms have. 00:04:55.590 --> 00:04:58.140 And yet, anybody looking at us 00:04:58.140 --> 00:05:02.370 versus a little roundworm in the soil or a fruit fly 00:05:02.370 --> 00:05:04.830 can tell that we're a much more complex organism 00:05:04.830 --> 00:05:06.780 with a much bigger brain, 00:05:06.780 --> 00:05:09.810 much more complex behavior, and so forth. 00:05:09.810 --> 00:05:11.060 So, how does this happen? 00:05:12.090 --> 00:05:16.200 Part of the answer to this very interesting mystery 00:05:16.200 --> 00:05:20.010 or paradox lies in the way that genes are organized 00:05:20.010 --> 00:05:21.690 and how they're regulated. 00:05:21.690 --> 00:05:23.520 And one of the most striking results 00:05:23.520 --> 00:05:26.310 of the genome sequencing project was to realize 00:05:26.310 --> 00:05:31.200 that a vast, vast majority of the DNA in our chromosomes 00:05:31.200 --> 00:05:35.070 is actually not coding for specific gene products, 00:05:35.070 --> 00:05:39.930 and that only roughly 3% of the DNA is actually encoding 00:05:39.930 --> 00:05:41.850 let's call those little arrows 00:05:41.850 --> 00:05:44.670 that I show you on this purple DNA 00:05:44.670 --> 00:05:46.350 are the gene coding regions. 00:05:46.350 --> 00:05:50.160 So, you'll notice that there's a lot of non-arrow sequences, 00:05:50.160 --> 00:05:52.560 which I'll show you in this next slide as green. 00:05:52.560 --> 00:05:54.870 These are non-coding regions. 00:05:54.870 --> 00:05:58.800 So, the vast majority, 97% or greater is non-coding, 00:05:58.800 --> 00:06:02.550 so what are these other sequences doing? 00:06:02.550 --> 00:06:05.790 And of course, it turns out that these sequences 00:06:05.790 --> 00:06:10.020 carry very important little fragments of DNA, 00:06:10.020 --> 00:06:12.420 which we call regulatory sequences. 00:06:12.420 --> 00:06:15.450 And these are the sequences that actually control 00:06:15.450 --> 00:06:19.020 whether a gene gets turned on or not. 00:06:19.020 --> 00:06:21.630 And I'll be spending much of the next 20 minutes 00:06:21.630 --> 00:06:24.390 telling you about how this process all works 00:06:24.390 --> 00:06:28.500 and what these little bits of DNA sequences 00:06:28.500 --> 00:06:32.283 actually function to control gene expression. 00:06:34.350 --> 00:06:37.290 Now, the other thing that I have to bring you up to date on 00:06:37.290 --> 00:06:40.980 is this mysterious process we're calling transcription, 00:06:40.980 --> 00:06:43.710 which reads double-stranded DNA 00:06:43.710 --> 00:06:45.390 and then makes a related molecule, 00:06:45.390 --> 00:06:47.550 which is a single-stranded RNA molecule, 00:06:47.550 --> 00:06:49.740 which is a informational molecule. 00:06:49.740 --> 00:06:54.300 That reaction is catalyzed by a very complex 00:06:54.300 --> 00:06:59.010 multi-subunit enzyme called RNA polymerase II. 00:06:59.010 --> 00:07:01.290 Now, there's the roman numeral II at the end of this 00:07:01.290 --> 00:07:05.733 because there were actually three enzymes in most mammals, 00:07:06.690 --> 00:07:09.270 at least three enzymes that carry out different processes 00:07:09.270 --> 00:07:11.610 and different types of RNA production. 00:07:11.610 --> 00:07:12.630 But I'm only gonna tell you 00:07:12.630 --> 00:07:16.380 about the ones that make the classical messenger RNA, 00:07:16.380 --> 00:07:19.350 which then ultimately becomes proteins. 00:07:19.350 --> 00:07:22.170 So, now one of the things that we learned 00:07:22.170 --> 00:07:25.860 early on in the study of mammalian 00:07:25.860 --> 00:07:30.090 or other multicellular organism transcription processes 00:07:30.090 --> 00:07:32.640 is that despite the fact that this enzyme 00:07:32.640 --> 00:07:34.893 is quite complex in its structure, 00:07:35.880 --> 00:07:39.480 it turns out to be an enzyme that's nevertheless 00:07:39.480 --> 00:07:42.450 needs a lot of help to do its job. 00:07:42.450 --> 00:07:45.090 So, on its own, this RNA polymerase II 00:07:45.090 --> 00:07:48.960 cannot tell the difference between the non-coding regions 00:07:48.960 --> 00:07:52.470 of the genome and places where it's supposed to be coding 00:07:52.470 --> 00:07:56.940 or reading to make the appropriate messenger RNAs. 00:07:56.940 --> 00:08:00.480 So, this sort of leads you to think that there must be 00:08:00.480 --> 00:08:04.950 a number of other factors that somehow direct 00:08:04.950 --> 00:08:08.490 RNA polymerase to the right place at the right time 00:08:08.490 --> 00:08:11.190 in the genome of every cell in your body 00:08:11.190 --> 00:08:14.010 so that the right products get made 00:08:14.010 --> 00:08:17.403 so each cell in your body is functioning properly. 00:08:18.480 --> 00:08:21.540 And this is where things get really interesting 00:08:21.540 --> 00:08:25.170 is some 25, 30 years ago, 00:08:25.170 --> 00:08:29.310 a number of laboratories took on the job 00:08:29.310 --> 00:08:32.640 of hunting for these elusive and, as it turned out, 00:08:32.640 --> 00:08:35.340 a specialized protein factors 00:08:35.340 --> 00:08:39.390 that recognize these little stretches of DNA sequences 00:08:39.390 --> 00:08:40.410 that I've been telling you about 00:08:40.410 --> 00:08:42.210 that make up the vast majority 00:08:42.210 --> 00:08:45.420 of the non-coding part of the genome. 00:08:45.420 --> 00:08:48.480 And how these proteins then can recognize 00:08:48.480 --> 00:08:51.150 and ultimately physically interact 00:08:51.150 --> 00:08:53.910 with these little bits of genetic information 00:08:53.910 --> 00:08:57.660 to then turn genes on or off. 00:08:57.660 --> 00:09:00.060 Now, in this lecture, 00:09:00.060 --> 00:09:04.320 I can't go into all the details of the types of experiments 00:09:04.320 --> 00:09:07.680 or the ranges of experiments that many, many laboratories 00:09:07.680 --> 00:09:09.330 have done over the last two decades 00:09:09.330 --> 00:09:12.510 to finally work out this molecular puzzle 00:09:12.510 --> 00:09:14.730 of how transcription works. 00:09:14.730 --> 00:09:17.100 But I can tell you that there are fundamentally 00:09:17.100 --> 00:09:19.320 two major approaches that have been taken 00:09:19.320 --> 00:09:24.320 over the last few decades to kind of get a parts list 00:09:24.510 --> 00:09:27.090 of the machinery that decodes the genome 00:09:27.090 --> 00:09:29.640 and carries out the process of transcription. 00:09:29.640 --> 00:09:32.430 One is kind of the old style, 00:09:32.430 --> 00:09:34.950 I'll call it bucket biochemistry 00:09:34.950 --> 00:09:38.820 or take a live cell, crush it up, 00:09:38.820 --> 00:09:41.640 spread out all of its parts and then try to figure out 00:09:41.640 --> 00:09:43.050 how to put it back together again, 00:09:43.050 --> 00:09:45.540 that's what I call in vitro biochemistry. 00:09:45.540 --> 00:09:47.910 And the other one is in vivo genetics 00:09:47.910 --> 00:09:50.880 where you effectively use genetic tools, 00:09:50.880 --> 00:09:54.570 mutagenesis to go in there and selectively remove 00:09:54.570 --> 00:09:59.100 or knock down or knock out certain genes and gene products 00:09:59.100 --> 00:10:01.620 and then ask what is the consequence on that cell 00:10:01.620 --> 00:10:02.970 or that organism? 00:10:02.970 --> 00:10:07.970 Both of these technologies are very powerful 00:10:08.250 --> 00:10:12.513 and highly complementary, and they continue to be used. 00:10:13.410 --> 00:10:16.410 Today, I will focus primarily 00:10:16.410 --> 00:10:18.510 on the in vitro biochemical techniques 00:10:18.510 --> 00:10:22.740 which led us to the discovery of the first few classes 00:10:22.740 --> 00:10:24.450 of transcription factors. 00:10:24.450 --> 00:10:25.800 And in subsequent lectures, 00:10:25.800 --> 00:10:29.580 we'll go to more recent technologies 00:10:29.580 --> 00:10:32.760 that allows us to sort of speed up this whole process 00:10:32.760 --> 00:10:35.370 of identifying key regulatory molecules 00:10:35.370 --> 00:10:37.323 and how they work. 00:10:38.790 --> 00:10:42.690 So, let's go back to the sort of the basic unit 00:10:42.690 --> 00:10:45.060 of gene expression, which is a gene, 00:10:45.060 --> 00:10:48.720 here shown in the orange arrow, 00:10:48.720 --> 00:10:51.780 and the non-coding sequences surrounding it. 00:10:51.780 --> 00:10:54.510 And you'll see that now I've added a few more elements 00:10:54.510 --> 00:10:56.100 to this purple DNA. 00:10:56.100 --> 00:10:59.160 You see some symbols, a blue square, 00:10:59.160 --> 00:11:02.580 a round circle that's pink, and then a yellow triangle. 00:11:02.580 --> 00:11:06.897 Those are just a way for me to graphically represent 00:11:06.897 --> 00:11:08.970 the little bits of DNA sequences 00:11:08.970 --> 00:11:11.250 that I told you about that are the regulatory sequences. 00:11:11.250 --> 00:11:14.730 So, the little round one happens to very GC-rich, 00:11:14.730 --> 00:11:17.970 the triangle one is a classical element 00:11:17.970 --> 00:11:19.200 that's called a TATA box, 00:11:19.200 --> 00:11:20.610 I'll tell you about a little bit later. 00:11:20.610 --> 00:11:23.550 And the blue one is yet another recognition element. 00:11:23.550 --> 00:11:26.520 So, why are we so interested in these little stretches 00:11:26.520 --> 00:11:29.760 of nucleic acid sequence in the genome 00:11:29.760 --> 00:11:33.240 when it's buried amongst billions of other sequences? 00:11:33.240 --> 00:11:35.580 Well, these individual little sequences 00:11:35.580 --> 00:11:38.730 turn out to be very important because of where they sit, 00:11:38.730 --> 00:11:42.180 you'll notice they're sitting near the top of the arrow, 00:11:42.180 --> 00:11:45.930 and they are recognized by very special proteins 00:11:45.930 --> 00:11:48.300 which are the transcription factors. 00:11:48.300 --> 00:11:50.640 So, now I've showing you some symbols 00:11:50.640 --> 00:11:53.880 with little cutouts which fit into either the square, 00:11:53.880 --> 00:11:56.040 the circle, or the triangle. 00:11:56.040 --> 00:11:58.560 So, transcription factors, 00:11:58.560 --> 00:12:03.150 at least one major family of transcription factors, 00:12:03.150 --> 00:12:06.660 are proteins whose three-dimensional structure 00:12:06.660 --> 00:12:10.110 is folded into a shape that allows them to recognize 00:12:10.110 --> 00:12:12.663 these short stretches of double-stranded DNA. 00:12:13.530 --> 00:12:15.900 In fact, largely through interactions 00:12:15.900 --> 00:12:17.400 with the major group of DNA, 00:12:17.400 --> 00:12:20.050 and I'll show you a structure of one in a little bit. 00:12:21.330 --> 00:12:24.210 So, now it turns out that there are probably 00:12:24.210 --> 00:12:26.610 thousands of these transcription factors 00:12:26.610 --> 00:12:29.010 because the number of genes that we have to control, 00:12:29.010 --> 00:12:33.480 as I showed you, is in the order of 20 or 25,000 genes. 00:12:33.480 --> 00:12:37.200 And so, it turns out that you need a pretty large percentage 00:12:37.200 --> 00:12:40.890 of the genome devoted to encoding these regulatory proteins 00:12:40.890 --> 00:12:45.240 in order for a complex organism like ourselves to survive. 00:12:45.240 --> 00:12:47.250 Then the other component of this, 00:12:47.250 --> 00:12:49.500 let's call it the transcriptional apparatus, 00:12:49.500 --> 00:12:51.900 is, of course, the enzyme that catalyzes RNA. 00:12:51.900 --> 00:12:56.490 And I already told you that this enzyme on its own 00:12:56.490 --> 00:12:59.370 can't tell the difference between random DNA sequence 00:12:59.370 --> 00:13:01.440 and a gene or a promoter. 00:13:01.440 --> 00:13:05.520 These other sequence-specific DNA-binding proteins 00:13:05.520 --> 00:13:08.550 are the ones that must recruit 00:13:08.550 --> 00:13:11.130 or otherwise direct RNA polymerase 00:13:11.130 --> 00:13:15.630 to essentially land on the right place and at the right time 00:13:15.630 --> 00:13:19.290 in the genome to turn on a certain subset of genes 00:13:19.290 --> 00:13:23.640 that are specifically required in a specialized cell type, 00:13:23.640 --> 00:13:26.460 whatever cell you happen to be looking at. 00:13:26.460 --> 00:13:30.090 So, that is kind of the first level of complexity 00:13:30.090 --> 00:13:32.910 of sort of informational interactions 00:13:32.910 --> 00:13:35.250 between the transcription factors 00:13:35.250 --> 00:13:38.250 and the more ubiquitous, 00:13:38.250 --> 00:13:41.853 and I would call it promiscuous RNA polymerase II enzyme. 00:13:43.560 --> 00:13:44.880 Well, as it turns out, 00:13:44.880 --> 00:13:49.020 it took several decades to work out 00:13:49.020 --> 00:13:52.770 most if not all of the components 00:13:52.770 --> 00:13:56.070 of this so-called transcriptional machinery. 00:13:56.070 --> 00:14:00.810 And it turns out in this slide I'm showing you 00:14:00.810 --> 00:14:03.030 things are already starting to get more complicated. 00:14:03.030 --> 00:14:04.740 So, not only do you have RNA polymerase, 00:14:04.740 --> 00:14:07.800 but you have a bunch of other proteins that go by names 00:14:07.800 --> 00:14:10.680 like TFIIA, B, 00:14:10.680 --> 00:14:13.350 you know, D, E, H, F, and so forth. 00:14:13.350 --> 00:14:16.440 So, it looks like there are going to be many, 00:14:16.440 --> 00:14:18.270 many proteins that are necessary 00:14:18.270 --> 00:14:21.930 to form the transcriptional apparatus. 00:14:21.930 --> 00:14:23.490 And then on top of that 00:14:23.490 --> 00:14:26.370 you need sequence-specific DNA-binding proteins 00:14:26.370 --> 00:14:28.560 which are already described to you 00:14:28.560 --> 00:14:32.730 to further inform or otherwise regulate the process 00:14:32.730 --> 00:14:35.580 of when a particular RNA polymerase molecule 00:14:35.580 --> 00:14:37.860 should be binding to a particular gene. 00:14:37.860 --> 00:14:40.110 So, that's the sort of overview, 00:14:40.110 --> 00:14:41.580 now let me get into the specifics 00:14:41.580 --> 00:14:45.480 and how did we actually discover these family of proteins. 00:14:45.480 --> 00:14:47.280 And it'll be interesting for you to see 00:14:47.280 --> 00:14:51.360 how science in this field evolved. 00:14:51.360 --> 00:14:54.210 Now, as is often the case 00:14:54.210 --> 00:14:56.850 when you first try to tackle a very complex problem, 00:14:56.850 --> 00:14:59.460 and, of course, we didn't really know how complex it was 00:14:59.460 --> 00:15:00.780 when we began these studies, 00:15:00.780 --> 00:15:03.480 but we assumed it might be complicated, 00:15:03.480 --> 00:15:06.660 certainly would be more complicated than systems 00:15:06.660 --> 00:15:09.150 that we had already had some idea about, 00:15:09.150 --> 00:15:13.710 for example, in bacteria or in bacteriophages. 00:15:13.710 --> 00:15:17.010 We took a lesson from our studies of bacteriophages 00:15:17.010 --> 00:15:20.430 and decided that to begin to dissect 00:15:20.430 --> 00:15:22.080 the molecular complexities 00:15:22.080 --> 00:15:24.750 of the transcription process in animal cells, 00:15:24.750 --> 00:15:26.850 we should start with viruses 00:15:26.850 --> 00:15:30.840 because we knew that viruses will enter these host cells, 00:15:30.840 --> 00:15:33.990 these complex cells that we ultimately want to study 00:15:33.990 --> 00:15:36.480 and have to use the same molecular machinery 00:15:36.480 --> 00:15:38.580 to transcribe their genes 00:15:38.580 --> 00:15:41.640 as the host mammalian cell would do. 00:15:41.640 --> 00:15:43.890 So, this was kind of a trick 00:15:43.890 --> 00:15:46.980 or a way to look at a molecular window 00:15:46.980 --> 00:15:49.710 into a complex system and try to simplify it. 00:15:49.710 --> 00:15:51.060 And in our case, 00:15:51.060 --> 00:15:54.510 the early studies of the late '70s and early '80s 00:15:54.510 --> 00:15:55.920 involved very simple, 00:15:55.920 --> 00:15:58.590 one of these simplest double-stranded DNA viruses 00:15:58.590 --> 00:16:00.840 called Simian virus 40. 00:16:00.840 --> 00:16:03.330 And Simian virus 40, of course, is a monkey virus, 00:16:03.330 --> 00:16:06.450 which was nice because it's very close to humans 00:16:06.450 --> 00:16:07.890 and many things that we could learn 00:16:07.890 --> 00:16:10.770 about the way this virus uses its host, 00:16:10.770 --> 00:16:13.140 which are monkey cells, to replicate 00:16:13.140 --> 00:16:16.440 and to express their RNAs and genes 00:16:16.440 --> 00:16:20.370 would be applicable to our studies of humans, as you'll see. 00:16:20.370 --> 00:16:23.190 And this virus was one of the first 00:16:23.190 --> 00:16:27.930 whose DNA, its double-stranded DNA of about 5,000 base pairs 00:16:27.930 --> 00:16:29.310 was fully sequenced. 00:16:29.310 --> 00:16:32.670 This was long before a rapid modern day sequencing 00:16:32.670 --> 00:16:35.580 was available, so this gave us a very powerful tool. 00:16:35.580 --> 00:16:38.340 It basically allowed us to look at the entire genome 00:16:38.340 --> 00:16:41.220 of this virus, which was tiny by comparison, 00:16:41.220 --> 00:16:44.880 only 5,243 base pairs. 00:16:44.880 --> 00:16:47.790 But just that information was already very important 00:16:47.790 --> 00:16:49.650 'cause it very quickly allowed us, 00:16:49.650 --> 00:16:52.920 for example, to map where the genes are. 00:16:52.920 --> 00:16:56.190 And one of the genes encoded a protein 00:16:56.190 --> 00:16:57.540 called a tumor antigen, 00:16:57.540 --> 00:17:00.210 which turns out to be a transcription factor. 00:17:00.210 --> 00:17:03.120 This then allowed us to get our hands 00:17:03.120 --> 00:17:06.030 basically to do biochemistry and genetics 00:17:06.030 --> 00:17:09.390 on the very first eukaryotic transcription factor, 00:17:09.390 --> 00:17:12.210 which in this case happens to be a represser. 00:17:12.210 --> 00:17:15.210 That is a protein that when it binds the DNA 00:17:15.210 --> 00:17:19.173 just the same way as I showed you for the the model case, 00:17:20.160 --> 00:17:23.910 it binds through specific protein DNA interactions. 00:17:23.910 --> 00:17:26.610 But in this case, actually shuts transcription down 00:17:26.610 --> 00:17:27.903 rather than turn it up. 00:17:29.580 --> 00:17:33.930 In the process of studying the way that this little virus 00:17:33.930 --> 00:17:36.480 when it infects a mammalian cell 00:17:36.480 --> 00:17:39.180 uses proteins like T-antigen 00:17:39.180 --> 00:17:42.540 to regulate its gene expression, 00:17:42.540 --> 00:17:45.990 it became clear that it had to use the host machinery 00:17:45.990 --> 00:17:48.180 to do the process. 00:17:48.180 --> 00:17:52.710 And that meant that there must be monkey proteins 00:17:52.710 --> 00:17:55.050 that are also involved in activating 00:17:55.050 --> 00:17:57.540 or repressing genes of this virus. 00:17:57.540 --> 00:18:00.510 And this then led us to the most important step, 00:18:00.510 --> 00:18:04.170 which is to transfer the technology we learned about viruses 00:18:04.170 --> 00:18:06.630 and how to work with the virus transcription factor 00:18:06.630 --> 00:18:09.390 like T-antigen to the cellular ones. 00:18:09.390 --> 00:18:11.220 And I'm gonna give you just one example 00:18:11.220 --> 00:18:13.920 of how the simple jump into the host cell 00:18:13.920 --> 00:18:17.820 allowed us to discover the first human transcription factor. 00:18:17.820 --> 00:18:21.180 So, the question that we then asked 00:18:21.180 --> 00:18:25.620 back in the early 1980s was what host molecule 00:18:25.620 --> 00:18:29.310 is regulating the expression of transcription of this virus 00:18:29.310 --> 00:18:31.260 when the virus is in the host? 00:18:31.260 --> 00:18:33.960 And we knew from the DNA sequence of the virus 00:18:33.960 --> 00:18:38.960 that there were these six very GC-rich snippets of DNA 00:18:39.780 --> 00:18:42.240 that were regulatory 'cause if we deleted them, 00:18:42.240 --> 00:18:45.570 the virus no longer would express the gene of interest. 00:18:45.570 --> 00:18:48.330 So, we knew that something was probably responsible 00:18:48.330 --> 00:18:51.300 for recognizing these GC boxes, 00:18:51.300 --> 00:18:53.970 and we knew that it wasn't a virally encoded gene 00:18:53.970 --> 00:18:56.610 because we had tested all of the viral genes 00:18:56.610 --> 00:18:58.950 of which there were only six to begin with. 00:18:58.950 --> 00:19:00.960 So, we knew it had to be a host gene 00:19:00.960 --> 00:19:04.740 and that led us to a whole, I would say, 00:19:04.740 --> 00:19:07.740 family of experiments that led to the discovery 00:19:07.740 --> 00:19:10.860 of sequence-specific mammalian transcription factors. 00:19:10.860 --> 00:19:13.860 And as I said, we could have taken multiple approaches 00:19:13.860 --> 00:19:16.830 to try to address this complicated issue. 00:19:16.830 --> 00:19:18.510 I'll just give you one example 00:19:18.510 --> 00:19:20.940 of using in vitro biochemistry 00:19:20.940 --> 00:19:24.780 to finally get our hands on this key sequence 00:19:24.780 --> 00:19:27.300 specific human transcription factor, 00:19:27.300 --> 00:19:30.993 which, of course, has a homologue in the monkey. 00:19:31.950 --> 00:19:35.190 And the way we did it was very interesting 00:19:35.190 --> 00:19:36.930 and simple in retrospect, 00:19:36.930 --> 00:19:39.030 and that is recognizing the fact 00:19:39.030 --> 00:19:41.760 that whatever this protein was, 00:19:41.760 --> 00:19:44.760 it had to have the property of recognizing 00:19:44.760 --> 00:19:49.650 those GC boxes that were sitting next to the the viral gene. 00:19:49.650 --> 00:19:52.200 We assume that it must be a sequence-specific 00:19:52.200 --> 00:19:54.060 DNA binding-protein, so all we had to do 00:19:54.060 --> 00:19:57.510 was figure out a way to extract proteins 00:19:57.510 --> 00:20:01.110 from human cells or monkey cells 00:20:01.110 --> 00:20:04.770 and then try to fish out those specific proteins 00:20:04.770 --> 00:20:06.660 out of the many thousands of different proteins 00:20:06.660 --> 00:20:09.720 that were in this gamish of cellular extract 00:20:09.720 --> 00:20:12.300 that would be responsible for discriminating 00:20:12.300 --> 00:20:17.100 between random DNA sequences and the specific GC box. 00:20:17.100 --> 00:20:20.580 And I'll quickly run through sort of the logic behind this. 00:20:20.580 --> 00:20:25.440 So, what I'm showing you here is a solid surface 00:20:25.440 --> 00:20:28.830 with DNA coupled to it that is highly enriched 00:20:28.830 --> 00:20:31.440 for the recognition element, the GC box, 00:20:31.440 --> 00:20:33.450 which should be the sequence 00:20:33.450 --> 00:20:35.370 recognized by the protein of interest. 00:20:35.370 --> 00:20:37.440 Now, we had no idea what this protein was gonna look like, 00:20:37.440 --> 00:20:39.510 how many proteins there were gonna be, and so forth, 00:20:39.510 --> 00:20:42.030 but we knew it had to recognize the GC box. 00:20:42.030 --> 00:20:45.090 So, we're gonna try to fish this out of a pool 00:20:45.090 --> 00:20:47.460 of many thousands of other proteins. 00:20:47.460 --> 00:20:49.500 Now, the the key trick here 00:20:49.500 --> 00:20:52.380 was that because all cell extracts 00:20:52.380 --> 00:20:54.870 contain not only one DNA binding protein, 00:20:54.870 --> 00:20:56.820 but, as I told you, thousands of different 00:20:56.820 --> 00:20:58.410 DNA binding proteins. 00:20:58.410 --> 00:21:00.870 But most of them, or in fact in our case, 00:21:00.870 --> 00:21:04.950 none of the other of several hundred to a thousand proteins 00:21:04.950 --> 00:21:08.790 that could bind DNA actually happen to recognize the GC box, 00:21:08.790 --> 00:21:11.190 they just bind other DNA sequences. 00:21:11.190 --> 00:21:14.070 So, to kind of favor our protein 00:21:14.070 --> 00:21:16.080 being able to bind to our GC box 00:21:16.080 --> 00:21:18.780 and not have to compete with all the other proteins, 00:21:18.780 --> 00:21:22.920 what we did was to add non-specific DNA 00:21:22.920 --> 00:21:26.700 and mask stoichiometric excess 00:21:26.700 --> 00:21:29.580 so that all the other proteins that wouldn't recognize 00:21:29.580 --> 00:21:33.270 the GC box would still have some partner to hang onto. 00:21:33.270 --> 00:21:34.920 And this trick worked very well. 00:21:34.920 --> 00:21:39.840 So, having the specific DNA on the solid resin 00:21:39.840 --> 00:21:43.710 and the non-specific DNA flowing all over the place, 00:21:43.710 --> 00:21:47.850 we could capture selectively the pink molecules here, 00:21:47.850 --> 00:21:50.160 which are the GC box recognition ones, 00:21:50.160 --> 00:21:52.560 and the blue-green molecules, 00:21:52.560 --> 00:21:56.040 of course, predominantly bind to non-specific DNA. 00:21:56.040 --> 00:21:58.470 I show you one little blue one on the column 00:21:58.470 --> 00:22:01.290 because nothing works perfectly in real science 00:22:01.290 --> 00:22:03.540 and tells you that we have to go through this process 00:22:03.540 --> 00:22:07.650 iteratively to actually finally obtain a preparation 00:22:07.650 --> 00:22:11.520 that's purely pink molecules with no green-blue ones. 00:22:11.520 --> 00:22:14.340 Well, that turned out to work very, very well. 00:22:14.340 --> 00:22:18.030 And that whole process of biochemical fractionation 00:22:18.030 --> 00:22:23.030 followed by a direct affinity sequence-specific DNA resin 00:22:23.730 --> 00:22:28.110 gave us the ability to perform a biochemical purification 00:22:28.110 --> 00:22:31.620 followed by a molecular cloning of the transcription factor 00:22:31.620 --> 00:22:35.040 that encodes the protein SP1. 00:22:35.040 --> 00:22:37.230 And then we carried out a bunch of experiments, 00:22:37.230 --> 00:22:38.580 which I'll tell you next, 00:22:38.580 --> 00:22:40.140 to show that this protein 00:22:40.140 --> 00:22:42.483 actually does activate transcription. 00:22:43.530 --> 00:22:46.380 And of course, we went back and we proved that this protein, 00:22:46.380 --> 00:22:49.170 which turned out to be a rather large polypeptide, 00:22:49.170 --> 00:22:52.020 can indeed recognize the GC box. 00:22:52.020 --> 00:22:55.500 And it doesn't matter if it's a GC box from the SV 0 genome 00:22:55.500 --> 00:22:59.460 or any other GC box that we could find in the human genome, 00:22:59.460 --> 00:23:02.370 it would find that sequence and bind to it 00:23:02.370 --> 00:23:05.670 and then it would generally activate transcription. 00:23:05.670 --> 00:23:08.280 So, this led to the discovery of the first 00:23:08.280 --> 00:23:10.830 of a very large family 00:23:10.830 --> 00:23:13.680 of sequence-specific DNA-binding proteins. 00:23:13.680 --> 00:23:15.960 Now, I told you that the way these proteins 00:23:15.960 --> 00:23:19.200 tend to recognize short DNA sequences 00:23:19.200 --> 00:23:21.900 is to interact with DNA through the major groove. 00:23:21.900 --> 00:23:23.220 And here's a perfect example. 00:23:23.220 --> 00:23:25.230 So, the thick blue model there 00:23:25.230 --> 00:23:28.590 shows the actual three structures 00:23:28.590 --> 00:23:29.910 that are called zinc fingers. 00:23:29.910 --> 00:23:31.470 And the reason they're called zinc fingers 00:23:31.470 --> 00:23:34.770 is because there are amino acids that are organized 00:23:34.770 --> 00:23:37.860 around a center that contains a zinc molecule 00:23:37.860 --> 00:23:41.190 which holds the three-dimensional shape of the polypeptide 00:23:41.190 --> 00:23:43.560 in a position just right 00:23:43.560 --> 00:23:45.570 for fitting into the major groove of the DNA. 00:23:45.570 --> 00:23:47.700 And the DNA here is shown in pink, 00:23:47.700 --> 00:23:50.160 and you can see that that blue outline 00:23:50.160 --> 00:23:52.710 fits right into the major groove of the DNA, 00:23:52.710 --> 00:23:54.690 but not to the minor groove. 00:23:54.690 --> 00:23:57.540 And one of the most important findings 00:23:57.540 --> 00:23:58.830 was not only the discovery 00:23:58.830 --> 00:24:00.750 of the first human transcription factor, 00:24:00.750 --> 00:24:04.890 but the realization that most if not all sequence-specific 00:24:04.890 --> 00:24:06.570 DNA-binding transcription factors 00:24:06.570 --> 00:24:09.090 have a similar structural motif. 00:24:09.090 --> 00:24:13.170 That is to say some structure is built to recognize 00:24:13.170 --> 00:24:15.750 sequences in the major groove of DNA. 00:24:15.750 --> 00:24:19.170 And these three-dimensional motifs 00:24:19.170 --> 00:24:23.760 are recognizable as amino acid sequences in the genome. 00:24:23.760 --> 00:24:27.810 So, we can now much more quickly scan the entire sequence 00:24:27.810 --> 00:24:29.760 of a genome and identify genes 00:24:29.760 --> 00:24:31.920 that are likely to be DNA-binding proteins 00:24:31.920 --> 00:24:34.860 as a result of understanding the structure-function 00:24:34.860 --> 00:24:38.853 relationships of these DNA-binding motifs like zinc fingers. 00:24:39.960 --> 00:24:42.690 So, what I'd like to show you now 00:24:42.690 --> 00:24:46.260 is that I've only introduced you to one class 00:24:46.260 --> 00:24:48.210 of transcription factors, 00:24:48.210 --> 00:24:51.210 which are the sequence-specific-DNA binding proteins. 00:24:51.210 --> 00:24:53.910 Well, I think I gave you a little taste 00:24:53.910 --> 00:24:55.320 of the level of complexity 00:24:55.320 --> 00:24:57.060 that's probably going to be needed 00:24:57.060 --> 00:24:59.580 to be able to build the machine 00:24:59.580 --> 00:25:02.940 that's ultimately going to be able to allow you 00:25:02.940 --> 00:25:07.050 to transcribe every gene in every cell of a human body. 00:25:07.050 --> 00:25:10.286 So, that turns out to be a much more elaborated machine 00:25:10.286 --> 00:25:12.090 than what I just showed you. 00:25:12.090 --> 00:25:14.400 So, I wanna show you now 00:25:14.400 --> 00:25:16.830 what is sort of our state-of-the-art thinking 00:25:16.830 --> 00:25:20.580 about what is actually needed to build the machinery 00:25:20.580 --> 00:25:25.380 at a gene to allow it to be expressed and transcribed. 00:25:25.380 --> 00:25:27.930 And the term I want to introduce you to 00:25:27.930 --> 00:25:30.870 is the pre-initiation complex. 00:25:30.870 --> 00:25:33.360 And it's pretty much what it says. 00:25:33.360 --> 00:25:35.850 It's the complex of multiple subunits 00:25:35.850 --> 00:25:40.850 that has to essentially land on the promoter of a gene 00:25:40.860 --> 00:25:44.433 which will be designated for later expression. 00:25:45.390 --> 00:25:49.980 And this is a process that is probably quite orderly, 00:25:49.980 --> 00:25:52.410 that is there's an order of events that happens, 00:25:52.410 --> 00:25:55.080 which we, by the way, are not entirely sure 00:25:55.080 --> 00:25:57.330 exactly what the order is or even if the order 00:25:57.330 --> 00:25:59.310 is the same from one gene to the next, 00:25:59.310 --> 00:26:02.070 but we can kind of see where it starts and where it ends up. 00:26:02.070 --> 00:26:03.690 And the pathway in between, 00:26:03.690 --> 00:26:06.540 I would say is still a little bit murky. 00:26:06.540 --> 00:26:10.290 And the story here again starts with a little snippet of DNA 00:26:10.290 --> 00:26:11.220 called the TATA box, 00:26:11.220 --> 00:26:13.560 which I already introduced you to briefly. 00:26:13.560 --> 00:26:18.560 It's an AT-rich sequence which sits at the five prime end 00:26:18.660 --> 00:26:20.970 or the beginning of many genes, but not all genes, 00:26:20.970 --> 00:26:25.383 maybe 20% of the genes might contain this AT-rich region. 00:26:26.460 --> 00:26:29.850 And that AT sequence is the signal 00:26:29.850 --> 00:26:31.500 or a landmark, if you like, 00:26:31.500 --> 00:26:33.930 for a particular protein to bind to it. 00:26:33.930 --> 00:26:35.850 And that protein is called, 00:26:35.850 --> 00:26:38.190 not surprisingly, the TATA-binding protein 00:26:38.190 --> 00:26:40.290 'cause it's the TATA sequence. 00:26:40.290 --> 00:26:43.620 And so, this represents a second class 00:26:43.620 --> 00:26:45.210 of transcription factors. 00:26:45.210 --> 00:26:48.240 These are not the type that I just introduced you to, 00:26:48.240 --> 00:26:50.640 which are gonna be different for every gene, 00:26:50.640 --> 00:26:52.200 the TATA sequence is present 00:26:52.200 --> 00:26:54.120 in a very large number of genes, 00:26:54.120 --> 00:26:56.700 so it can't be gene specific, 00:26:56.700 --> 00:26:58.680 but it turns out to be very crucial 00:26:58.680 --> 00:27:02.130 for our understanding of how gene regulation works. 00:27:02.130 --> 00:27:06.300 So, so you start with a TATA-binding protein 00:27:06.300 --> 00:27:07.950 finding a TATA box. 00:27:07.950 --> 00:27:10.440 We later found out that the TATA-binding protein 00:27:10.440 --> 00:27:13.920 rarely functions on its own and has a bunch of friends 00:27:13.920 --> 00:27:17.010 that we call TAFs or TBP associated factors. 00:27:17.010 --> 00:27:19.260 And now you're talking about an assembly 00:27:19.260 --> 00:27:23.340 of multi-subunit complex of almost a million daltons. 00:27:23.340 --> 00:27:26.070 There are somewhere between 12 to 15 subunits 00:27:26.070 --> 00:27:27.780 in addition to the TATA-binding protein 00:27:27.780 --> 00:27:30.930 that make up this little complex of proteins 00:27:30.930 --> 00:27:33.390 that kind of travels around together. 00:27:33.390 --> 00:27:35.520 And this is found in most cell types, 00:27:35.520 --> 00:27:38.760 and later on I'll show you in a subsequent lecture 00:27:38.760 --> 00:27:40.590 that not every cell type 00:27:40.590 --> 00:27:43.530 might have exactly the same compliment of these subunits, 00:27:43.530 --> 00:27:47.850 but many of them have this prototypic complex. 00:27:47.850 --> 00:27:51.960 Is this enough for building the pre-initiation complex? 00:27:51.960 --> 00:27:54.120 Unfortunately not. 00:27:54.120 --> 00:27:57.630 It turns out that there are a host of other, 00:27:57.630 --> 00:28:00.030 I'll call them ancillary factors 00:28:00.030 --> 00:28:03.330 in addition to the multi-subunit RNA polymerase itself 00:28:03.330 --> 00:28:08.330 that are necessary for you to build up an ensemble 00:28:08.490 --> 00:28:12.180 that is necessary to form an active 00:28:12.180 --> 00:28:16.380 ready to activate transcriptional pre-initiation complex 00:28:16.380 --> 00:28:17.213 or the PIC. 00:28:19.890 --> 00:28:23.520 And this is kind of the picture we're getting to, 00:28:23.520 --> 00:28:26.160 and even this picture with many, many colors 00:28:26.160 --> 00:28:28.560 and many, many different polypeptides, 00:28:28.560 --> 00:28:30.330 you know, that adds up to probably greater 00:28:30.330 --> 00:28:33.660 than 85 individual proteins 00:28:33.660 --> 00:28:36.960 that all have to kind of fit together like a jigsaw puzzle. 00:28:36.960 --> 00:28:39.360 It's probably not even the whole story, 00:28:39.360 --> 00:28:42.390 you'll notice I still have one big red question mark there 00:28:42.390 --> 00:28:46.980 because I think as we begin to study specific cell types 00:28:46.980 --> 00:28:50.280 and specific processes like embryonic development 00:28:50.280 --> 00:28:52.890 or germ layer formation, 00:28:52.890 --> 00:28:55.770 additional components that are not present here 00:28:55.770 --> 00:28:58.484 in this prototypic pre-initiation complex 00:28:58.484 --> 00:29:00.210 will come into play, 00:29:00.210 --> 00:29:03.420 and that's a subject of subsequent lecture. 00:29:03.420 --> 00:29:06.360 But already you can tell that the transcriptional machinery 00:29:06.360 --> 00:29:08.673 is anything but simple. 00:29:09.630 --> 00:29:12.720 So, can we get a better idea of what transcription 00:29:12.720 --> 00:29:16.440 might actually look like and what's happening 00:29:16.440 --> 00:29:18.360 when a transcription process takes place? 00:29:18.360 --> 00:29:21.720 So, let me first of all say that I'm gonna finish my lecture 00:29:21.720 --> 00:29:24.450 now with a little cartoon, 00:29:24.450 --> 00:29:29.160 which is our attempt to imagine 00:29:29.160 --> 00:29:30.990 the events that take place 00:29:30.990 --> 00:29:33.000 when you form a pre-initiation complex, 00:29:33.000 --> 00:29:37.410 you bring regulatory proteins to the activated gene 00:29:37.410 --> 00:29:40.020 and what happens during this process. 00:29:40.020 --> 00:29:43.530 Now, keep in mind that this is at this point 00:29:43.530 --> 00:29:47.700 mostly a cartoon that is in our imagination 00:29:47.700 --> 00:29:52.700 and only parts or if any of this is probably real, 00:29:52.890 --> 00:29:56.340 but it gives you a sense of the complexity 00:29:56.340 --> 00:29:58.800 of the transactions that have to take place 00:29:58.800 --> 00:30:02.040 just for one gene to transcribe and express itself. 00:30:02.040 --> 00:30:04.290 So, let me show you the movie, 00:30:04.290 --> 00:30:07.410 and then we'll finish just by keeping in mind 00:30:07.410 --> 00:30:09.960 that there's much to be learned. 00:30:09.960 --> 00:30:13.290 And in my next lecture, we'll go into the selectivity 00:30:13.290 --> 00:30:16.560 of this process in specialized cell types. 00:30:16.560 --> 00:30:20.280 So, now let's see what this sort of this cartoon 00:30:20.280 --> 00:30:22.140 of transcription looks like. 00:30:22.140 --> 00:30:23.700 So, we start off with DNA 00:30:23.700 --> 00:30:26.790 with some preassembled TFIID molecule, 00:30:26.790 --> 00:30:28.800 and along comes this other green molecule, 00:30:28.800 --> 00:30:30.690 which is actually a co-factor, 00:30:30.690 --> 00:30:32.760 which then forms this very large complex 00:30:32.760 --> 00:30:34.170 with RNA polymerase. 00:30:34.170 --> 00:30:37.830 And then a distal activator protein came in 00:30:37.830 --> 00:30:39.270 and activated the process. 00:30:39.270 --> 00:30:44.270 And this molecule, this bluish molecule that's moved away 00:30:44.610 --> 00:30:48.060 from the complex is actually the RNA polymerase. 00:30:48.060 --> 00:30:51.810 And that little yellow sort of bead on a string 00:30:51.810 --> 00:30:53.640 is actually the RNA product. 00:30:53.640 --> 00:30:57.660 So, that gives you a sense of things have to happen quickly 00:30:57.660 --> 00:30:59.850 and yet it involves many, many molecules 00:30:59.850 --> 00:31:02.460 having to assemble and then disassemble 00:31:02.460 --> 00:31:04.170 to give you this reaction to happen. 00:31:04.170 --> 00:31:06.810 And in my next lecture, 00:31:06.810 --> 00:31:10.380 we'll go into more specific aspects of this reaction, 00:31:10.380 --> 00:31:13.470 and particularly during embryonic development 00:31:13.470 --> 00:31:16.203 and tissue-specific gene expression.