0:00:00.240,0:00:03.720 - My name is Bob Tjian,[br]I'm a Professor of MCB 0:00:03.720,0:00:06.000 at the University of[br]California at Berkeley, 0:00:06.000,0:00:10.020 and I'm also serving as the[br]President of the Howard Hughes. 0:00:10.020,0:00:13.440 I'm going to spend the[br]next 25 or 30 minutes 0:00:13.440,0:00:15.870 telling you about some fundamentals 0:00:15.870,0:00:19.620 of one of the most important[br]molecular processes 0:00:19.620,0:00:23.490 in living cells, which is[br]the expression of genes 0:00:23.490,0:00:25.593 through a process called transcription. 0:00:26.610,0:00:31.610 Now, first, to understand[br]what gene expression means, 0:00:32.190,0:00:36.360 you have to have a sense[br]of what we tend to refer to 0:00:36.360,0:00:39.900 in the field as a central[br]dogma of molecular biology. 0:00:39.900,0:00:41.280 Another way to think about this 0:00:41.280,0:00:45.877 is the flow of biological[br]information from DNA, 0:00:46.950,0:00:48.480 in other words, our chromosomes, 0:00:48.480,0:00:50.523 which every cell has its compliment, 0:00:52.020,0:00:56.730 to be transcribed into a[br]sister molecule called RNA. 0:00:56.730,0:00:59.940 So, this process of[br]converting DNA into RNA 0:00:59.940,0:01:01.800 is called transcription, 0:01:01.800,0:01:05.370 and that is the topic of this lecture. 0:01:05.370,0:01:08.760 This process is very complicated, 0:01:08.760,0:01:11.640 as you'll see by the[br]end of my two lectures, 0:01:11.640,0:01:13.260 and it is very important 0:01:13.260,0:01:18.210 for many, many fundamental[br]processes in biology. 0:01:18.210,0:01:20.910 So, what I'm gonna[br]spend today's lecture on 0:01:20.910,0:01:24.840 is the discovery of a large family 0:01:24.840,0:01:27.030 of transcription proteins. 0:01:27.030,0:01:31.380 These are factors we call[br]them that are key molecules 0:01:31.380,0:01:35.370 that regulate the use[br]of genetic information 0:01:35.370,0:01:37.863 that has been encoded in the genome. 0:01:38.730,0:01:41.670 Now, transcription factors or proteins 0:01:41.670,0:01:46.050 are involved in many[br]fundamental aspects of biology, 0:01:46.050,0:01:48.270 including embryonic development, 0:01:48.270,0:01:51.660 cellular differentiation, and cell fate. 0:01:51.660,0:01:55.140 In other words, pretty much[br]what your cells are doing, 0:01:55.140,0:01:56.400 how a tissue works, 0:01:56.400,0:01:59.970 and how an organism[br]survives and reproduces 0:01:59.970,0:02:03.480 is dependent on the[br]process of gene expression. 0:02:03.480,0:02:06.873 And the first step in this[br]process is transcription. 0:02:09.150,0:02:11.760 Now, there are many other reasons 0:02:11.760,0:02:14.940 why a large group of people and scientists 0:02:14.940,0:02:16.410 are interested in transcription, 0:02:16.410,0:02:19.350 and another reason is that understanding 0:02:19.350,0:02:21.750 the fundamental molecular mechanisms 0:02:21.750,0:02:24.990 that controls transcription in humans 0:02:24.990,0:02:26.560 or in any other organism 0:02:27.420,0:02:31.800 can inform us and teach[br]us about what happens 0:02:31.800,0:02:34.860 when something goes wrong,[br]for example, in diseases. 0:02:34.860,0:02:38.100 And I list here just a few diseases 0:02:38.100,0:02:40.740 that we could study as a result 0:02:40.740,0:02:43.380 of understanding the[br]structure and function 0:02:43.380,0:02:45.900 of these transcription factor proteins 0:02:45.900,0:02:48.240 that I'm going to be telling you about. 0:02:48.240,0:02:50.550 And of course, the hope[br]is that in understanding 0:02:50.550,0:02:54.780 the molecular underpinnings of[br]complex diseases like cancer, 0:02:54.780,0:02:57.900 diabetes, Parkinson's, and so forth, 0:02:57.900,0:03:01.120 that we will be able to develop and use 0:03:02.130,0:03:05.040 better, more specific therapeutic drugs 0:03:05.040,0:03:07.890 and also to develop more accurate 0:03:07.890,0:03:10.200 and rapid diagnostic tools. 0:03:10.200,0:03:11.820 So, those are a couple of the reasons 0:03:11.820,0:03:14.070 why many of us have spent, 0:03:14.070,0:03:15.960 in my case, over 30 years 0:03:15.960,0:03:19.173 studying this process of[br]transcriptional regulation. 0:03:20.520,0:03:23.010 Now, to get the whole thing started, 0:03:23.010,0:03:24.360 I have to give you a sense 0:03:24.360,0:03:27.180 of what the magnitude of the problem is. 0:03:27.180,0:03:29.910 So, imagine that one would[br]really like to understand 0:03:29.910,0:03:34.770 how this process of decoding[br]the genome happens in humans. 0:03:34.770,0:03:35.970 So, as you may know, 0:03:35.970,0:03:39.120 the human genome has[br]some 3 billion base pairs 0:03:39.120,0:03:41.400 or bits of genetic information, 0:03:41.400,0:03:45.000 and that encodes roughly 22,000 genes. 0:03:45.000,0:03:48.120 These are stretches of DNA sequence 0:03:48.120,0:03:51.390 that encode ultimately a product 0:03:51.390,0:03:55.650 that is a protein which actually[br]makes the cells function. 0:03:55.650,0:03:57.810 So, as I already explained to you, 0:03:57.810,0:04:00.720 there's this flow of[br]biological information 0:04:00.720,0:04:04.140 where you have to extract the[br]information buried in DNA, 0:04:04.140,0:04:05.460 convert it into RNA. 0:04:05.460,0:04:07.650 And what I'm not gonna[br]tell you about today 0:04:07.650,0:04:10.140 is the process of going[br]from RNA to protein, 0:04:10.140,0:04:13.950 which is a reaction called[br]a translational reaction. 0:04:13.950,0:04:17.040 I'm going to instead just[br]focus on the first step 0:04:17.040,0:04:18.720 of converting DNA into RNA, 0:04:18.720,0:04:20.620 which is the process of transcription. 0:04:22.980,0:04:26.850 Now, one of the most amazing results 0:04:26.850,0:04:29.160 that we got over the last decade or so 0:04:29.160,0:04:31.890 was when the human genome[br]was entirely sequenced, 0:04:31.890,0:04:34.530 the first few that were sequenced, 0:04:34.530,0:04:38.130 we realized that actually[br]the number of genes in humans 0:04:38.130,0:04:42.420 is not vastly different[br]from many other organisms, 0:04:42.420,0:04:45.690 even simple organisms like little worms 0:04:45.690,0:04:47.400 or fruit flies and so forth. 0:04:47.400,0:04:51.210 That is roughly 22 to 25,000 genes 0:04:51.210,0:04:53.010 is all the number of genes 0:04:53.010,0:04:55.590 that all of these[br]different organisms have. 0:04:55.590,0:04:58.140 And yet, anybody looking at us 0:04:58.140,0:05:02.370 versus a little roundworm[br]in the soil or a fruit fly 0:05:02.370,0:05:04.830 can tell that we're a[br]much more complex organism 0:05:04.830,0:05:06.780 with a much bigger brain, 0:05:06.780,0:05:09.810 much more complex behavior, and so forth. 0:05:09.810,0:05:11.060 So, how does this happen? 0:05:12.090,0:05:16.200 Part of the answer to this[br]very interesting mystery 0:05:16.200,0:05:20.010 or paradox lies in the way[br]that genes are organized 0:05:20.010,0:05:21.690 and how they're regulated. 0:05:21.690,0:05:23.520 And one of the most striking results 0:05:23.520,0:05:26.310 of the genome sequencing[br]project was to realize 0:05:26.310,0:05:31.200 that a vast, vast majority[br]of the DNA in our chromosomes 0:05:31.200,0:05:35.070 is actually not coding for[br]specific gene products, 0:05:35.070,0:05:39.930 and that only roughly 3% of[br]the DNA is actually encoding 0:05:39.930,0:05:41.850 let's call those little arrows 0:05:41.850,0:05:44.670 that I show you on this purple DNA 0:05:44.670,0:05:46.350 are the gene coding regions. 0:05:46.350,0:05:50.160 So, you'll notice that there's[br]a lot of non-arrow sequences, 0:05:50.160,0:05:52.560 which I'll show you in[br]this next slide as green. 0:05:52.560,0:05:54.870 These are non-coding regions. 0:05:54.870,0:05:58.800 So, the vast majority, 97%[br]or greater is non-coding, 0:05:58.800,0:06:02.550 so what are these other sequences doing? 0:06:02.550,0:06:05.790 And of course, it turns[br]out that these sequences 0:06:05.790,0:06:10.020 carry very important[br]little fragments of DNA, 0:06:10.020,0:06:12.420 which we call regulatory sequences. 0:06:12.420,0:06:15.450 And these are the sequences[br]that actually control 0:06:15.450,0:06:19.020 whether a gene gets turned on or not. 0:06:19.020,0:06:21.630 And I'll be spending much[br]of the next 20 minutes 0:06:21.630,0:06:24.390 telling you about how[br]this process all works 0:06:24.390,0:06:28.500 and what these little[br]bits of DNA sequences 0:06:28.500,0:06:32.283 actually function to[br]control gene expression. 0:06:34.350,0:06:37.290 Now, the other thing that I[br]have to bring you up to date on 0:06:37.290,0:06:40.980 is this mysterious process[br]we're calling transcription, 0:06:40.980,0:06:43.710 which reads double-stranded DNA 0:06:43.710,0:06:45.390 and then makes a related molecule, 0:06:45.390,0:06:47.550 which is a single-stranded RNA molecule, 0:06:47.550,0:06:49.740 which is a informational molecule. 0:06:49.740,0:06:54.300 That reaction is catalyzed[br]by a very complex 0:06:54.300,0:06:59.010 multi-subunit enzyme[br]called RNA polymerase II. 0:06:59.010,0:07:01.290 Now, there's the roman[br]numeral II at the end of this 0:07:01.290,0:07:05.733 because there were actually[br]three enzymes in most mammals, 0:07:06.690,0:07:09.270 at least three enzymes that[br]carry out different processes 0:07:09.270,0:07:11.610 and different types of RNA production. 0:07:11.610,0:07:12.630 But I'm only gonna tell you 0:07:12.630,0:07:16.380 about the ones that make[br]the classical messenger RNA, 0:07:16.380,0:07:19.350 which then ultimately becomes proteins. 0:07:19.350,0:07:22.170 So, now one of the things that we learned 0:07:22.170,0:07:25.860 early on in the study of mammalian 0:07:25.860,0:07:30.090 or other multicellular organism[br]transcription processes 0:07:30.090,0:07:32.640 is that despite the fact that this enzyme 0:07:32.640,0:07:34.893 is quite complex in its structure, 0:07:35.880,0:07:39.480 it turns out to be an[br]enzyme that's nevertheless 0:07:39.480,0:07:42.450 needs a lot of help to do its job. 0:07:42.450,0:07:45.090 So, on its own, this RNA polymerase II 0:07:45.090,0:07:48.960 cannot tell the difference[br]between the non-coding regions 0:07:48.960,0:07:52.470 of the genome and places where[br]it's supposed to be coding 0:07:52.470,0:07:56.940 or reading to make the[br]appropriate messenger RNAs. 0:07:56.940,0:08:00.480 So, this sort of leads you[br]to think that there must be 0:08:00.480,0:08:04.950 a number of other factors[br]that somehow direct 0:08:04.950,0:08:08.490 RNA polymerase to the right[br]place at the right time 0:08:08.490,0:08:11.190 in the genome of every cell in your body 0:08:11.190,0:08:14.010 so that the right products get made 0:08:14.010,0:08:17.403 so each cell in your body[br]is functioning properly. 0:08:18.480,0:08:21.540 And this is where things[br]get really interesting 0:08:21.540,0:08:25.170 is some 25, 30 years ago, 0:08:25.170,0:08:29.310 a number of laboratories took on the job 0:08:29.310,0:08:32.640 of hunting for these elusive[br]and, as it turned out, 0:08:32.640,0:08:35.340 a specialized protein factors 0:08:35.340,0:08:39.390 that recognize these little[br]stretches of DNA sequences 0:08:39.390,0:08:40.410 that I've been telling you about 0:08:40.410,0:08:42.210 that make up the vast majority 0:08:42.210,0:08:45.420 of the non-coding part of the genome. 0:08:45.420,0:08:48.480 And how these proteins then can recognize 0:08:48.480,0:08:51.150 and ultimately physically interact 0:08:51.150,0:08:53.910 with these little bits[br]of genetic information 0:08:53.910,0:08:57.660 to then turn genes on or off. 0:08:57.660,0:09:00.060 Now, in this lecture, 0:09:00.060,0:09:04.320 I can't go into all the details[br]of the types of experiments 0:09:04.320,0:09:07.680 or the ranges of experiments[br]that many, many laboratories 0:09:07.680,0:09:09.330 have done over the last two decades 0:09:09.330,0:09:12.510 to finally work out this molecular puzzle 0:09:12.510,0:09:14.730 of how transcription works. 0:09:14.730,0:09:17.100 But I can tell you that[br]there are fundamentally 0:09:17.100,0:09:19.320 two major approaches that have been taken 0:09:19.320,0:09:24.320 over the last few decades[br]to kind of get a parts list 0:09:24.510,0:09:27.090 of the machinery that decodes the genome 0:09:27.090,0:09:29.640 and carries out the[br]process of transcription. 0:09:29.640,0:09:32.430 One is kind of the old style, 0:09:32.430,0:09:34.950 I'll call it bucket biochemistry 0:09:34.950,0:09:38.820 or take a live cell, crush it up, 0:09:38.820,0:09:41.640 spread out all of its parts[br]and then try to figure out 0:09:41.640,0:09:43.050 how to put it back together again, 0:09:43.050,0:09:45.540 that's what I call in vitro biochemistry. 0:09:45.540,0:09:47.910 And the other one is in vivo genetics 0:09:47.910,0:09:50.880 where you effectively use genetic tools, 0:09:50.880,0:09:54.570 mutagenesis to go in there[br]and selectively remove 0:09:54.570,0:09:59.100 or knock down or knock out[br]certain genes and gene products 0:09:59.100,0:10:01.620 and then ask what is the[br]consequence on that cell 0:10:01.620,0:10:02.970 or that organism? 0:10:02.970,0:10:07.970 Both of these technologies[br]are very powerful 0:10:08.250,0:10:12.513 and highly complementary,[br]and they continue to be used. 0:10:13.410,0:10:16.410 Today, I will focus primarily 0:10:16.410,0:10:18.510 on the in vitro biochemical techniques 0:10:18.510,0:10:22.740 which led us to the discovery[br]of the first few classes 0:10:22.740,0:10:24.450 of transcription factors. 0:10:24.450,0:10:25.800 And in subsequent lectures, 0:10:25.800,0:10:29.580 we'll go to more recent technologies 0:10:29.580,0:10:32.760 that allows us to sort of[br]speed up this whole process 0:10:32.760,0:10:35.370 of identifying key regulatory molecules 0:10:35.370,0:10:37.323 and how they work. 0:10:38.790,0:10:42.690 So, let's go back to the[br]sort of the basic unit 0:10:42.690,0:10:45.060 of gene expression, which is a gene, 0:10:45.060,0:10:48.720 here shown in the orange arrow, 0:10:48.720,0:10:51.780 and the non-coding[br]sequences surrounding it. 0:10:51.780,0:10:54.510 And you'll see that now I've[br]added a few more elements 0:10:54.510,0:10:56.100 to this purple DNA. 0:10:56.100,0:10:59.160 You see some symbols, a blue square, 0:10:59.160,0:11:02.580 a round circle that's pink,[br]and then a yellow triangle. 0:11:02.580,0:11:06.897 Those are just a way for[br]me to graphically represent 0:11:06.897,0:11:08.970 the little bits of DNA sequences 0:11:08.970,0:11:11.250 that I told you about that[br]are the regulatory sequences. 0:11:11.250,0:11:14.730 So, the little round one[br]happens to very GC-rich, 0:11:14.730,0:11:17.970 the triangle one is a classical element 0:11:17.970,0:11:19.200 that's called a TATA box, 0:11:19.200,0:11:20.610 I'll tell you about a little bit later. 0:11:20.610,0:11:23.550 And the blue one is yet[br]another recognition element. 0:11:23.550,0:11:26.520 So, why are we so interested[br]in these little stretches 0:11:26.520,0:11:29.760 of nucleic acid sequence in the genome 0:11:29.760,0:11:33.240 when it's buried amongst[br]billions of other sequences? 0:11:33.240,0:11:35.580 Well, these individual little sequences 0:11:35.580,0:11:38.730 turn out to be very important[br]because of where they sit, 0:11:38.730,0:11:42.180 you'll notice they're sitting[br]near the top of the arrow, 0:11:42.180,0:11:45.930 and they are recognized[br]by very special proteins 0:11:45.930,0:11:48.300 which are the transcription factors. 0:11:48.300,0:11:50.640 So, now I've showing you some symbols 0:11:50.640,0:11:53.880 with little cutouts which[br]fit into either the square, 0:11:53.880,0:11:56.040 the circle, or the triangle. 0:11:56.040,0:11:58.560 So, transcription factors, 0:11:58.560,0:12:03.150 at least one major family[br]of transcription factors, 0:12:03.150,0:12:06.660 are proteins whose[br]three-dimensional structure 0:12:06.660,0:12:10.110 is folded into a shape that[br]allows them to recognize 0:12:10.110,0:12:12.663 these short stretches[br]of double-stranded DNA. 0:12:13.530,0:12:15.900 In fact, largely through interactions 0:12:15.900,0:12:17.400 with the major group of DNA, 0:12:17.400,0:12:20.050 and I'll show you a structure[br]of one in a little bit. 0:12:21.330,0:12:24.210 So, now it turns out[br]that there are probably 0:12:24.210,0:12:26.610 thousands of these transcription factors 0:12:26.610,0:12:29.010 because the number of genes[br]that we have to control, 0:12:29.010,0:12:33.480 as I showed you, is in the[br]order of 20 or 25,000 genes. 0:12:33.480,0:12:37.200 And so, it turns out that you[br]need a pretty large percentage 0:12:37.200,0:12:40.890 of the genome devoted to encoding[br]these regulatory proteins 0:12:40.890,0:12:45.240 in order for a complex organism[br]like ourselves to survive. 0:12:45.240,0:12:47.250 Then the other component of this, 0:12:47.250,0:12:49.500 let's call it the[br]transcriptional apparatus, 0:12:49.500,0:12:51.900 is, of course, the enzyme[br]that catalyzes RNA. 0:12:51.900,0:12:56.490 And I already told you[br]that this enzyme on its own 0:12:56.490,0:12:59.370 can't tell the difference[br]between random DNA sequence 0:12:59.370,0:13:01.440 and a gene or a promoter. 0:13:01.440,0:13:05.520 These other sequence-specific[br]DNA-binding proteins 0:13:05.520,0:13:08.550 are the ones that must recruit 0:13:08.550,0:13:11.130 or otherwise direct RNA polymerase 0:13:11.130,0:13:15.630 to essentially land on the right[br]place and at the right time 0:13:15.630,0:13:19.290 in the genome to turn on[br]a certain subset of genes 0:13:19.290,0:13:23.640 that are specifically required[br]in a specialized cell type, 0:13:23.640,0:13:26.460 whatever cell you happen to be looking at. 0:13:26.460,0:13:30.090 So, that is kind of the[br]first level of complexity 0:13:30.090,0:13:32.910 of sort of informational interactions 0:13:32.910,0:13:35.250 between the transcription factors 0:13:35.250,0:13:38.250 and the more ubiquitous, 0:13:38.250,0:13:41.853 and I would call it promiscuous[br]RNA polymerase II enzyme. 0:13:43.560,0:13:44.880 Well, as it turns out, 0:13:44.880,0:13:49.020 it took several decades to work out 0:13:49.020,0:13:52.770 most if not all of the components 0:13:52.770,0:13:56.070 of this so-called[br]transcriptional machinery. 0:13:56.070,0:14:00.810 And it turns out in this[br]slide I'm showing you 0:14:00.810,0:14:03.030 things are already starting[br]to get more complicated. 0:14:03.030,0:14:04.740 So, not only do you have RNA polymerase, 0:14:04.740,0:14:07.800 but you have a bunch of other[br]proteins that go by names 0:14:07.800,0:14:10.680 like TFIIA, B, 0:14:10.680,0:14:13.350 you know, D, E, H, F, and so forth. 0:14:13.350,0:14:16.440 So, it looks like there[br]are going to be many, 0:14:16.440,0:14:18.270 many proteins that are necessary 0:14:18.270,0:14:21.930 to form the transcriptional apparatus. 0:14:21.930,0:14:23.490 And then on top of that 0:14:23.490,0:14:26.370 you need sequence-specific[br]DNA-binding proteins 0:14:26.370,0:14:28.560 which are already described to you 0:14:28.560,0:14:32.730 to further inform or[br]otherwise regulate the process 0:14:32.730,0:14:35.580 of when a particular[br]RNA polymerase molecule 0:14:35.580,0:14:37.860 should be binding to a particular gene. 0:14:37.860,0:14:40.110 So, that's the sort of overview, 0:14:40.110,0:14:41.580 now let me get into the specifics 0:14:41.580,0:14:45.480 and how did we actually discover[br]these family of proteins. 0:14:45.480,0:14:47.280 And it'll be interesting for you to see 0:14:47.280,0:14:51.360 how science in this field evolved. 0:14:51.360,0:14:54.210 Now, as is often the case 0:14:54.210,0:14:56.850 when you first try to tackle[br]a very complex problem, 0:14:56.850,0:14:59.460 and, of course, we didn't[br]really know how complex it was 0:14:59.460,0:15:00.780 when we began these studies, 0:15:00.780,0:15:03.480 but we assumed it might be complicated, 0:15:03.480,0:15:06.660 certainly would be more[br]complicated than systems 0:15:06.660,0:15:09.150 that we had already had some idea about, 0:15:09.150,0:15:13.710 for example, in bacteria[br]or in bacteriophages. 0:15:13.710,0:15:17.010 We took a lesson from our[br]studies of bacteriophages 0:15:17.010,0:15:20.430 and decided that to begin to dissect 0:15:20.430,0:15:22.080 the molecular complexities 0:15:22.080,0:15:24.750 of the transcription[br]process in animal cells, 0:15:24.750,0:15:26.850 we should start with viruses 0:15:26.850,0:15:30.840 because we knew that viruses[br]will enter these host cells, 0:15:30.840,0:15:33.990 these complex cells that[br]we ultimately want to study 0:15:33.990,0:15:36.480 and have to use the[br]same molecular machinery 0:15:36.480,0:15:38.580 to transcribe their genes 0:15:38.580,0:15:41.640 as the host mammalian cell would do. 0:15:41.640,0:15:43.890 So, this was kind of a trick 0:15:43.890,0:15:46.980 or a way to look at a molecular window 0:15:46.980,0:15:49.710 into a complex system[br]and try to simplify it. 0:15:49.710,0:15:51.060 And in our case, 0:15:51.060,0:15:54.510 the early studies of the[br]late '70s and early '80s 0:15:54.510,0:15:55.920 involved very simple, 0:15:55.920,0:15:58.590 one of these simplest[br]double-stranded DNA viruses 0:15:58.590,0:16:00.840 called Simian virus 40. 0:16:00.840,0:16:03.330 And Simian virus 40, of[br]course, is a monkey virus, 0:16:03.330,0:16:06.450 which was nice because[br]it's very close to humans 0:16:06.450,0:16:07.890 and many things that we could learn 0:16:07.890,0:16:10.770 about the way this virus uses its host, 0:16:10.770,0:16:13.140 which are monkey cells, to replicate 0:16:13.140,0:16:16.440 and to express their RNAs and genes 0:16:16.440,0:16:20.370 would be applicable to our[br]studies of humans, as you'll see. 0:16:20.370,0:16:23.190 And this virus was one of the first 0:16:23.190,0:16:27.930 whose DNA, its double-stranded[br]DNA of about 5,000 base pairs 0:16:27.930,0:16:29.310 was fully sequenced. 0:16:29.310,0:16:32.670 This was long before a[br]rapid modern day sequencing 0:16:32.670,0:16:35.580 was available, so this gave[br]us a very powerful tool. 0:16:35.580,0:16:38.340 It basically allowed us to[br]look at the entire genome 0:16:38.340,0:16:41.220 of this virus, which[br]was tiny by comparison, 0:16:41.220,0:16:44.880 only 5,243 base pairs. 0:16:44.880,0:16:47.790 But just that information[br]was already very important 0:16:47.790,0:16:49.650 'cause it very quickly allowed us, 0:16:49.650,0:16:52.920 for example, to map where the genes are. 0:16:52.920,0:16:56.190 And one of the genes encoded a protein 0:16:56.190,0:16:57.540 called a tumor antigen, 0:16:57.540,0:17:00.210 which turns out to be[br]a transcription factor. 0:17:00.210,0:17:03.120 This then allowed us to get our hands 0:17:03.120,0:17:06.030 basically to do biochemistry and genetics 0:17:06.030,0:17:09.390 on the very first eukaryotic[br]transcription factor, 0:17:09.390,0:17:12.210 which in this case[br]happens to be a represser. 0:17:12.210,0:17:15.210 That is a protein that[br]when it binds the DNA 0:17:15.210,0:17:19.173 just the same way as I showed[br]you for the the model case, 0:17:20.160,0:17:23.910 it binds through specific[br]protein DNA interactions. 0:17:23.910,0:17:26.610 But in this case, actually[br]shuts transcription down 0:17:26.610,0:17:27.903 rather than turn it up. 0:17:29.580,0:17:33.930 In the process of studying[br]the way that this little virus 0:17:33.930,0:17:36.480 when it infects a mammalian cell 0:17:36.480,0:17:39.180 uses proteins like T-antigen 0:17:39.180,0:17:42.540 to regulate its gene expression, 0:17:42.540,0:17:45.990 it became clear that it had[br]to use the host machinery 0:17:45.990,0:17:48.180 to do the process. 0:17:48.180,0:17:52.710 And that meant that there[br]must be monkey proteins 0:17:52.710,0:17:55.050 that are also involved in activating 0:17:55.050,0:17:57.540 or repressing genes of this virus. 0:17:57.540,0:18:00.510 And this then led us to[br]the most important step, 0:18:00.510,0:18:04.170 which is to transfer the[br]technology we learned about viruses 0:18:04.170,0:18:06.630 and how to work with the[br]virus transcription factor 0:18:06.630,0:18:09.390 like T-antigen to the cellular ones. 0:18:09.390,0:18:11.220 And I'm gonna give you just one example 0:18:11.220,0:18:13.920 of how the simple jump into the host cell 0:18:13.920,0:18:17.820 allowed us to discover the first[br]human transcription factor. 0:18:17.820,0:18:21.180 So, the question that we then asked 0:18:21.180,0:18:25.620 back in the early 1980s[br]was what host molecule 0:18:25.620,0:18:29.310 is regulating the expression[br]of transcription of this virus 0:18:29.310,0:18:31.260 when the virus is in the host? 0:18:31.260,0:18:33.960 And we knew from the DNA[br]sequence of the virus 0:18:33.960,0:18:38.960 that there were these six[br]very GC-rich snippets of DNA 0:18:39.780,0:18:42.240 that were regulatory[br]'cause if we deleted them, 0:18:42.240,0:18:45.570 the virus no longer would[br]express the gene of interest. 0:18:45.570,0:18:48.330 So, we knew that something[br]was probably responsible 0:18:48.330,0:18:51.300 for recognizing these GC boxes, 0:18:51.300,0:18:53.970 and we knew that it wasn't[br]a virally encoded gene 0:18:53.970,0:18:56.610 because we had tested[br]all of the viral genes 0:18:56.610,0:18:58.950 of which there were[br]only six to begin with. 0:18:58.950,0:19:00.960 So, we knew it had to be a host gene 0:19:00.960,0:19:04.740 and that led us to a whole, I would say, 0:19:04.740,0:19:07.740 family of experiments[br]that led to the discovery 0:19:07.740,0:19:10.860 of sequence-specific mammalian[br]transcription factors. 0:19:10.860,0:19:13.860 And as I said, we could have[br]taken multiple approaches 0:19:13.860,0:19:16.830 to try to address this complicated issue. 0:19:16.830,0:19:18.510 I'll just give you one example 0:19:18.510,0:19:20.940 of using in vitro biochemistry 0:19:20.940,0:19:24.780 to finally get our hands[br]on this key sequence 0:19:24.780,0:19:27.300 specific human transcription factor, 0:19:27.300,0:19:30.993 which, of course, has a[br]homologue in the monkey. 0:19:31.950,0:19:35.190 And the way we did it was very interesting 0:19:35.190,0:19:36.930 and simple in retrospect, 0:19:36.930,0:19:39.030 and that is recognizing the fact 0:19:39.030,0:19:41.760 that whatever this protein was, 0:19:41.760,0:19:44.760 it had to have the property of recognizing 0:19:44.760,0:19:49.650 those GC boxes that were sitting[br]next to the the viral gene. 0:19:49.650,0:19:52.200 We assume that it must[br]be a sequence-specific 0:19:52.200,0:19:54.060 DNA binding-protein, so all we had to do 0:19:54.060,0:19:57.510 was figure out a way to extract proteins 0:19:57.510,0:20:01.110 from human cells or monkey cells 0:20:01.110,0:20:04.770 and then try to fish out[br]those specific proteins 0:20:04.770,0:20:06.660 out of the many thousands[br]of different proteins 0:20:06.660,0:20:09.720 that were in this gamish[br]of cellular extract 0:20:09.720,0:20:12.300 that would be responsible[br]for discriminating 0:20:12.300,0:20:17.100 between random DNA sequences[br]and the specific GC box. 0:20:17.100,0:20:20.580 And I'll quickly run through[br]sort of the logic behind this. 0:20:20.580,0:20:25.440 So, what I'm showing you[br]here is a solid surface 0:20:25.440,0:20:28.830 with DNA coupled to it[br]that is highly enriched 0:20:28.830,0:20:31.440 for the recognition element, the GC box, 0:20:31.440,0:20:33.450 which should be the sequence 0:20:33.450,0:20:35.370 recognized by the protein of interest. 0:20:35.370,0:20:37.440 Now, we had no idea what this[br]protein was gonna look like, 0:20:37.440,0:20:39.510 how many proteins there[br]were gonna be, and so forth, 0:20:39.510,0:20:42.030 but we knew it had to[br]recognize the GC box. 0:20:42.030,0:20:45.090 So, we're gonna try to[br]fish this out of a pool 0:20:45.090,0:20:47.460 of many thousands of other proteins. 0:20:47.460,0:20:49.500 Now, the the key trick here 0:20:49.500,0:20:52.380 was that because all cell extracts 0:20:52.380,0:20:54.870 contain not only one DNA binding protein, 0:20:54.870,0:20:56.820 but, as I told you, thousands of different 0:20:56.820,0:20:58.410 DNA binding proteins. 0:20:58.410,0:21:00.870 But most of them, or in fact in our case, 0:21:00.870,0:21:04.950 none of the other of several[br]hundred to a thousand proteins 0:21:04.950,0:21:08.790 that could bind DNA actually[br]happen to recognize the GC box, 0:21:08.790,0:21:11.190 they just bind other DNA sequences. 0:21:11.190,0:21:14.070 So, to kind of favor our protein 0:21:14.070,0:21:16.080 being able to bind to our GC box 0:21:16.080,0:21:18.780 and not have to compete[br]with all the other proteins, 0:21:18.780,0:21:22.920 what we did was to add non-specific DNA 0:21:22.920,0:21:26.700 and mask stoichiometric excess 0:21:26.700,0:21:29.580 so that all the other proteins[br]that wouldn't recognize 0:21:29.580,0:21:33.270 the GC box would still have[br]some partner to hang onto. 0:21:33.270,0:21:34.920 And this trick worked very well. 0:21:34.920,0:21:39.840 So, having the specific[br]DNA on the solid resin 0:21:39.840,0:21:43.710 and the non-specific DNA[br]flowing all over the place, 0:21:43.710,0:21:47.850 we could capture selectively[br]the pink molecules here, 0:21:47.850,0:21:50.160 which are the GC box recognition ones, 0:21:50.160,0:21:52.560 and the blue-green molecules, 0:21:52.560,0:21:56.040 of course, predominantly[br]bind to non-specific DNA. 0:21:56.040,0:21:58.470 I show you one little[br]blue one on the column 0:21:58.470,0:22:01.290 because nothing works[br]perfectly in real science 0:22:01.290,0:22:03.540 and tells you that we have[br]to go through this process 0:22:03.540,0:22:07.650 iteratively to actually[br]finally obtain a preparation 0:22:07.650,0:22:11.520 that's purely pink molecules[br]with no green-blue ones. 0:22:11.520,0:22:14.340 Well, that turned out[br]to work very, very well. 0:22:14.340,0:22:18.030 And that whole process of[br]biochemical fractionation 0:22:18.030,0:22:23.030 followed by a direct affinity[br]sequence-specific DNA resin 0:22:23.730,0:22:28.110 gave us the ability to perform[br]a biochemical purification 0:22:28.110,0:22:31.620 followed by a molecular cloning[br]of the transcription factor 0:22:31.620,0:22:35.040 that encodes the protein SP1. 0:22:35.040,0:22:37.230 And then we carried out[br]a bunch of experiments, 0:22:37.230,0:22:38.580 which I'll tell you next, 0:22:38.580,0:22:40.140 to show that this protein 0:22:40.140,0:22:42.483 actually does activate transcription. 0:22:43.530,0:22:46.380 And of course, we went back and[br]we proved that this protein, 0:22:46.380,0:22:49.170 which turned out to be a[br]rather large polypeptide, 0:22:49.170,0:22:52.020 can indeed recognize the GC box. 0:22:52.020,0:22:55.500 And it doesn't matter if it's[br]a GC box from the SV 0 genome 0:22:55.500,0:22:59.460 or any other GC box that we[br]could find in the human genome, 0:22:59.460,0:23:02.370 it would find that sequence and bind to it 0:23:02.370,0:23:05.670 and then it would generally[br]activate transcription. 0:23:05.670,0:23:08.280 So, this led to the discovery of the first 0:23:08.280,0:23:10.830 of a very large family 0:23:10.830,0:23:13.680 of sequence-specific DNA-binding proteins. 0:23:13.680,0:23:15.960 Now, I told you that[br]the way these proteins 0:23:15.960,0:23:19.200 tend to recognize short DNA sequences 0:23:19.200,0:23:21.900 is to interact with DNA[br]through the major groove. 0:23:21.900,0:23:23.220 And here's a perfect example. 0:23:23.220,0:23:25.230 So, the thick blue model there 0:23:25.230,0:23:28.590 shows the actual three structures 0:23:28.590,0:23:29.910 that are called zinc fingers. 0:23:29.910,0:23:31.470 And the reason they're called zinc fingers 0:23:31.470,0:23:34.770 is because there are amino[br]acids that are organized 0:23:34.770,0:23:37.860 around a center that[br]contains a zinc molecule 0:23:37.860,0:23:41.190 which holds the three-dimensional[br]shape of the polypeptide 0:23:41.190,0:23:43.560 in a position just right 0:23:43.560,0:23:45.570 for fitting into the[br]major groove of the DNA. 0:23:45.570,0:23:47.700 And the DNA here is shown in pink, 0:23:47.700,0:23:50.160 and you can see that that blue outline 0:23:50.160,0:23:52.710 fits right into the[br]major groove of the DNA, 0:23:52.710,0:23:54.690 but not to the minor groove. 0:23:54.690,0:23:57.540 And one of the most important findings 0:23:57.540,0:23:58.830 was not only the discovery 0:23:58.830,0:24:00.750 of the first human transcription factor, 0:24:00.750,0:24:04.890 but the realization that most[br]if not all sequence-specific 0:24:04.890,0:24:06.570 DNA-binding transcription factors 0:24:06.570,0:24:09.090 have a similar structural motif. 0:24:09.090,0:24:13.170 That is to say some structure[br]is built to recognize 0:24:13.170,0:24:15.750 sequences in the major groove of DNA. 0:24:15.750,0:24:19.170 And these three-dimensional motifs 0:24:19.170,0:24:23.760 are recognizable as amino[br]acid sequences in the genome. 0:24:23.760,0:24:27.810 So, we can now much more[br]quickly scan the entire sequence 0:24:27.810,0:24:29.760 of a genome and identify genes 0:24:29.760,0:24:31.920 that are likely to be DNA-binding proteins 0:24:31.920,0:24:34.860 as a result of understanding[br]the structure-function 0:24:34.860,0:24:38.853 relationships of these DNA-binding[br]motifs like zinc fingers. 0:24:39.960,0:24:42.690 So, what I'd like to show you now 0:24:42.690,0:24:46.260 is that I've only[br]introduced you to one class 0:24:46.260,0:24:48.210 of transcription factors, 0:24:48.210,0:24:51.210 which are the sequence-specific-DNA[br]binding proteins. 0:24:51.210,0:24:53.910 Well, I think I gave you a little taste 0:24:53.910,0:24:55.320 of the level of complexity 0:24:55.320,0:24:57.060 that's probably going to be needed 0:24:57.060,0:24:59.580 to be able to build the machine 0:24:59.580,0:25:02.940 that's ultimately going[br]to be able to allow you 0:25:02.940,0:25:07.050 to transcribe every gene in[br]every cell of a human body. 0:25:07.050,0:25:10.286 So, that turns out to be a[br]much more elaborated machine 0:25:10.286,0:25:12.090 than what I just showed you. 0:25:12.090,0:25:14.400 So, I wanna show you now 0:25:14.400,0:25:16.830 what is sort of our[br]state-of-the-art thinking 0:25:16.830,0:25:20.580 about what is actually[br]needed to build the machinery 0:25:20.580,0:25:25.380 at a gene to allow it to be[br]expressed and transcribed. 0:25:25.380,0:25:27.930 And the term I want to introduce you to 0:25:27.930,0:25:30.870 is the pre-initiation complex. 0:25:30.870,0:25:33.360 And it's pretty much what it says. 0:25:33.360,0:25:35.850 It's the complex of multiple subunits 0:25:35.850,0:25:40.850 that has to essentially land[br]on the promoter of a gene 0:25:40.860,0:25:44.433 which will be designated[br]for later expression. 0:25:45.390,0:25:49.980 And this is a process that[br]is probably quite orderly, 0:25:49.980,0:25:52.410 that is there's an order[br]of events that happens, 0:25:52.410,0:25:55.080 which we, by the way,[br]are not entirely sure 0:25:55.080,0:25:57.330 exactly what the order[br]is or even if the order 0:25:57.330,0:25:59.310 is the same from one gene to the next, 0:25:59.310,0:26:02.070 but we can kind of see where[br]it starts and where it ends up. 0:26:02.070,0:26:03.690 And the pathway in between, 0:26:03.690,0:26:06.540 I would say is still a little bit murky. 0:26:06.540,0:26:10.290 And the story here again starts[br]with a little snippet of DNA 0:26:10.290,0:26:11.220 called the TATA box, 0:26:11.220,0:26:13.560 which I already introduced you to briefly. 0:26:13.560,0:26:18.560 It's an AT-rich sequence which[br]sits at the five prime end 0:26:18.660,0:26:20.970 or the beginning of many[br]genes, but not all genes, 0:26:20.970,0:26:25.383 maybe 20% of the genes might[br]contain this AT-rich region. 0:26:26.460,0:26:29.850 And that AT sequence is the signal 0:26:29.850,0:26:31.500 or a landmark, if you like, 0:26:31.500,0:26:33.930 for a particular protein to bind to it. 0:26:33.930,0:26:35.850 And that protein is called, 0:26:35.850,0:26:38.190 not surprisingly, the TATA-binding protein 0:26:38.190,0:26:40.290 'cause it's the TATA sequence. 0:26:40.290,0:26:43.620 And so, this represents a second class 0:26:43.620,0:26:45.210 of transcription factors. 0:26:45.210,0:26:48.240 These are not the type that[br]I just introduced you to, 0:26:48.240,0:26:50.640 which are gonna be[br]different for every gene, 0:26:50.640,0:26:52.200 the TATA sequence is present 0:26:52.200,0:26:54.120 in a very large number of genes, 0:26:54.120,0:26:56.700 so it can't be gene specific, 0:26:56.700,0:26:58.680 but it turns out to be very crucial 0:26:58.680,0:27:02.130 for our understanding of[br]how gene regulation works. 0:27:02.130,0:27:06.300 So, so you start with[br]a TATA-binding protein 0:27:06.300,0:27:07.950 finding a TATA box. 0:27:07.950,0:27:10.440 We later found out that[br]the TATA-binding protein 0:27:10.440,0:27:13.920 rarely functions on its own[br]and has a bunch of friends 0:27:13.920,0:27:17.010 that we call TAFs or[br]TBP associated factors. 0:27:17.010,0:27:19.260 And now you're talking about an assembly 0:27:19.260,0:27:23.340 of multi-subunit complex of[br]almost a million daltons. 0:27:23.340,0:27:26.070 There are somewhere[br]between 12 to 15 subunits 0:27:26.070,0:27:27.780 in addition to the TATA-binding protein 0:27:27.780,0:27:30.930 that make up this little[br]complex of proteins 0:27:30.930,0:27:33.390 that kind of travels around together. 0:27:33.390,0:27:35.520 And this is found in most cell types, 0:27:35.520,0:27:38.760 and later on I'll show you[br]in a subsequent lecture 0:27:38.760,0:27:40.590 that not every cell type 0:27:40.590,0:27:43.530 might have exactly the same[br]compliment of these subunits, 0:27:43.530,0:27:47.850 but many of them have[br]this prototypic complex. 0:27:47.850,0:27:51.960 Is this enough for building[br]the pre-initiation complex? 0:27:51.960,0:27:54.120 Unfortunately not. 0:27:54.120,0:27:57.630 It turns out that there[br]are a host of other, 0:27:57.630,0:28:00.030 I'll call them ancillary factors 0:28:00.030,0:28:03.330 in addition to the multi-subunit[br]RNA polymerase itself 0:28:03.330,0:28:08.330 that are necessary for you[br]to build up an ensemble 0:28:08.490,0:28:12.180 that is necessary to form an active 0:28:12.180,0:28:16.380 ready to activate transcriptional[br]pre-initiation complex 0:28:16.380,0:28:17.213 or the PIC. 0:28:19.890,0:28:23.520 And this is kind of the[br]picture we're getting to, 0:28:23.520,0:28:26.160 and even this picture[br]with many, many colors 0:28:26.160,0:28:28.560 and many, many different polypeptides, 0:28:28.560,0:28:30.330 you know, that adds up to probably greater 0:28:30.330,0:28:33.660 than 85 individual proteins 0:28:33.660,0:28:36.960 that all have to kind of fit[br]together like a jigsaw puzzle. 0:28:36.960,0:28:39.360 It's probably not even the whole story, 0:28:39.360,0:28:42.390 you'll notice I still have one[br]big red question mark there 0:28:42.390,0:28:46.980 because I think as we begin[br]to study specific cell types 0:28:46.980,0:28:50.280 and specific processes[br]like embryonic development 0:28:50.280,0:28:52.890 or germ layer formation, 0:28:52.890,0:28:55.770 additional components[br]that are not present here 0:28:55.770,0:28:58.484 in this prototypic pre-initiation complex 0:28:58.484,0:29:00.210 will come into play, 0:29:00.210,0:29:03.420 and that's a subject[br]of subsequent lecture. 0:29:03.420,0:29:06.360 But already you can tell that[br]the transcriptional machinery 0:29:06.360,0:29:08.673 is anything but simple. 0:29:09.630,0:29:12.720 So, can we get a better[br]idea of what transcription 0:29:12.720,0:29:16.440 might actually look like[br]and what's happening 0:29:16.440,0:29:18.360 when a transcription process takes place? 0:29:18.360,0:29:21.720 So, let me first of all say[br]that I'm gonna finish my lecture 0:29:21.720,0:29:24.450 now with a little cartoon, 0:29:24.450,0:29:29.160 which is our attempt to imagine 0:29:29.160,0:29:30.990 the events that take place 0:29:30.990,0:29:33.000 when you form a pre-initiation complex, 0:29:33.000,0:29:37.410 you bring regulatory proteins[br]to the activated gene 0:29:37.410,0:29:40.020 and what happens during this process. 0:29:40.020,0:29:43.530 Now, keep in mind that[br]this is at this point 0:29:43.530,0:29:47.700 mostly a cartoon that[br]is in our imagination 0:29:47.700,0:29:52.700 and only parts or if any[br]of this is probably real, 0:29:52.890,0:29:56.340 but it gives you a sense of the complexity 0:29:56.340,0:29:58.800 of the transactions[br]that have to take place 0:29:58.800,0:30:02.040 just for one gene to[br]transcribe and express itself. 0:30:02.040,0:30:04.290 So, let me show you the movie, 0:30:04.290,0:30:07.410 and then we'll finish[br]just by keeping in mind 0:30:07.410,0:30:09.960 that there's much to be learned. 0:30:09.960,0:30:13.290 And in my next lecture,[br]we'll go into the selectivity 0:30:13.290,0:30:16.560 of this process in specialized cell types. 0:30:16.560,0:30:20.280 So, now let's see what[br]this sort of this cartoon 0:30:20.280,0:30:22.140 of transcription looks like. 0:30:22.140,0:30:23.700 So, we start off with DNA 0:30:23.700,0:30:26.790 with some preassembled TFIID molecule, 0:30:26.790,0:30:28.800 and along comes this other green molecule, 0:30:28.800,0:30:30.690 which is actually a co-factor, 0:30:30.690,0:30:32.760 which then forms this very large complex 0:30:32.760,0:30:34.170 with RNA polymerase. 0:30:34.170,0:30:37.830 And then a distal[br]activator protein came in 0:30:37.830,0:30:39.270 and activated the process. 0:30:39.270,0:30:44.270 And this molecule, this bluish[br]molecule that's moved away 0:30:44.610,0:30:48.060 from the complex is[br]actually the RNA polymerase. 0:30:48.060,0:30:51.810 And that little yellow[br]sort of bead on a string 0:30:51.810,0:30:53.640 is actually the RNA product. 0:30:53.640,0:30:57.660 So, that gives you a sense of[br]things have to happen quickly 0:30:57.660,0:30:59.850 and yet it involves many, many molecules 0:30:59.850,0:31:02.460 having to assemble and then disassemble 0:31:02.460,0:31:04.170 to give you this reaction to happen. 0:31:04.170,0:31:06.810 And in my next lecture, 0:31:06.810,0:31:10.380 we'll go into more specific[br]aspects of this reaction, 0:31:10.380,0:31:13.470 and particularly during[br]embryonic development 0:31:13.470,0:31:16.203 and tissue-specific gene expression.