-
- My name is Bob Tjian,
I'm a Professor of MCB
-
at the University of
California at Berkeley,
-
and I'm also serving as the
President of the Howard Hughes.
-
I'm going to spend the
next 25 or 30 minutes
-
telling you about some fundamentals
-
of one of the most important
molecular processes
-
in living cells, which is
the expression of genes
-
through a process called transcription.
-
Now, first, to understand
what gene expression means,
-
you have to have a sense
of what we tend to refer to
-
in the field as a central
dogma of molecular biology.
-
Another way to think about this
-
is the flow of biological
information from DNA,
-
in other words, our chromosomes,
-
which every cell has its compliment,
-
to be transcribed into a
sister molecule called RNA.
-
So, this process of
converting DNA into RNA
-
is called transcription,
-
and that is the topic of this lecture.
-
This process is very complicated,
-
as you'll see by the
end of my two lectures,
-
and it is very important
-
for many, many fundamental
processes in biology.
-
So, what I'm gonna
spend today's lecture on
-
is the discovery of a large family
-
of transcription proteins.
-
These are factors we call
them that are key molecules
-
that regulate the use
of genetic information
-
that has been encoded in the genome.
-
Now, transcription factors or proteins
-
are involved in many
fundamental aspects of biology,
-
including embryonic development,
-
cellular differentiation, and cell fate.
-
In other words, pretty much
what your cells are doing,
-
how a tissue works,
-
and how an organism
survives and reproduces
-
is dependent on the
process of gene expression.
-
And the first step in this
process is transcription.
-
Now, there are many other reasons
-
why a large group of people and scientists
-
are interested in transcription,
-
and another reason is that understanding
-
the fundamental molecular mechanisms
-
that controls transcription in humans
-
or in any other organism
-
can inform us and teach
us about what happens
-
when something goes wrong,
for example, in diseases.
-
And I list here just a few diseases
-
that we could study as a result
-
of understanding the
structure and function
-
of these transcription factor proteins
-
that I'm going to be telling you about.
-
And of course, the hope
is that in understanding
-
the molecular underpinnings of
complex diseases like cancer,
-
diabetes, Parkinson's, and so forth,
-
that we will be able to develop and use
-
better, more specific therapeutic drugs
-
and also to develop more accurate
-
and rapid diagnostic tools.
-
So, those are a couple of the reasons
-
why many of us have spent,
-
in my case, over 30 years
-
studying this process of
transcriptional regulation.
-
Now, to get the whole thing started,
-
I have to give you a sense
-
of what the magnitude of the problem is.
-
So, imagine that one would
really like to understand
-
how this process of decoding
the genome happens in humans.
-
So, as you may know,
-
the human genome has
some 3 billion base pairs
-
or bits of genetic information,
-
and that encodes roughly 22,000 genes.
-
These are stretches of DNA sequence
-
that encode ultimately a product
-
that is a protein which actually
makes the cells function.
-
So, as I already explained to you,
-
there's this flow of
biological information
-
where you have to extract the
information buried in DNA,
-
convert it into RNA.
-
And what I'm not gonna
tell you about today
-
is the process of going
from RNA to protein,
-
which is a reaction called
a translational reaction.
-
I'm going to instead just
focus on the first step
-
of converting DNA into RNA,
-
which is the process of transcription.
-
Now, one of the most amazing results
-
that we got over the last decade or so
-
was when the human genome
was entirely sequenced,
-
the first few that were sequenced,
-
we realized that actually
the number of genes in humans
-
is not vastly different
from many other organisms,
-
even simple organisms like little worms
-
or fruit flies and so forth.
-
That is roughly 22 to 25,000 genes
-
is all the number of genes
-
that all of these
different organisms have.
-
And yet, anybody looking at us
-
versus a little roundworm
in the soil or a fruit fly
-
can tell that we're a
much more complex organism
-
with a much bigger brain,
-
much more complex behavior, and so forth.
-
So, how does this happen?
-
Part of the answer to this
very interesting mystery
-
or paradox lies in the way
that genes are organized
-
and how they're regulated.
-
And one of the most striking results
-
of the genome sequencing
project was to realize
-
that a vast, vast majority
of the DNA in our chromosomes
-
is actually not coding for
specific gene products,
-
and that only roughly 3% of
the DNA is actually encoding
-
let's call those little arrows
-
that I show you on this purple DNA
-
are the gene coding regions.
-
So, you'll notice that there's
a lot of non-arrow sequences,
-
which I'll show you in
this next slide as green.
-
These are non-coding regions.
-
So, the vast majority, 97%
or greater is non-coding,
-
so what are these other sequences doing?
-
And of course, it turns
out that these sequences
-
carry very important
little fragments of DNA,
-
which we call regulatory sequences.
-
And these are the sequences
that actually control
-
whether a gene gets turned on or not.
-
And I'll be spending much
of the next 20 minutes
-
telling you about how
this process all works
-
and what these little
bits of DNA sequences
-
actually function to
control gene expression.
-
Now, the other thing that I
have to bring you up to date on
-
is this mysterious process
we're calling transcription,
-
which reads double-stranded DNA
-
and then makes a related molecule,
-
which is a single-stranded RNA molecule,
-
which is a informational molecule.
-
That reaction is catalyzed
by a very complex
-
multi-subunit enzyme
called RNA polymerase II.
-
Now, there's the roman
numeral II at the end of this
-
because there were actually
three enzymes in most mammals,
-
at least three enzymes that
carry out different processes
-
and different types of RNA production.
-
But I'm only gonna tell you
-
about the ones that make
the classical messenger RNA,
-
which then ultimately becomes proteins.
-
So, now one of the things that we learned
-
early on in the study of mammalian
-
or other multicellular organism
transcription processes
-
is that despite the fact that this enzyme
-
is quite complex in its structure,
-
it turns out to be an
enzyme that's nevertheless
-
needs a lot of help to do its job.
-
So, on its own, this RNA polymerase II
-
cannot tell the difference
between the non-coding regions
-
of the genome and places where
it's supposed to be coding
-
or reading to make the
appropriate messenger RNAs.
-
So, this sort of leads you
to think that there must be
-
a number of other factors
that somehow direct
-
RNA polymerase to the right
place at the right time
-
in the genome of every cell in your body
-
so that the right products get made
-
so each cell in your body
is functioning properly.
-
And this is where things
get really interesting
-
is some 25, 30 years ago,
-
a number of laboratories took on the job
-
of hunting for these elusive
and, as it turned out,
-
a specialized protein factors
-
that recognize these little
stretches of DNA sequences
-
that I've been telling you about
-
that make up the vast majority
-
of the non-coding part of the genome.
-
And how these proteins then can recognize
-
and ultimately physically interact
-
with these little bits
of genetic information
-
to then turn genes on or off.
-
Now, in this lecture,
-
I can't go into all the details
of the types of experiments
-
or the ranges of experiments
that many, many laboratories
-
have done over the last two decades
-
to finally work out this molecular puzzle
-
of how transcription works.
-
But I can tell you that
there are fundamentally
-
two major approaches that have been taken
-
over the last few decades
to kind of get a parts list
-
of the machinery that decodes the genome
-
and carries out the
process of transcription.
-
One is kind of the old style,
-
I'll call it bucket biochemistry
-
or take a live cell, crush it up,
-
spread out all of its parts
and then try to figure out
-
how to put it back together again,
-
that's what I call in vitro biochemistry.
-
And the other one is in vivo genetics
-
where you effectively use genetic tools,
-
mutagenesis to go in there
and selectively remove
-
or knock down or knock out
certain genes and gene products
-
and then ask what is the
consequence on that cell
-
or that organism?
-
Both of these technologies
are very powerful
-
and highly complementary,
and they continue to be used.
-
Today, I will focus primarily
-
on the in vitro biochemical techniques
-
which led us to the discovery
of the first few classes
-
of transcription factors.
-
And in subsequent lectures,
-
we'll go to more recent technologies
-
that allows us to sort of
speed up this whole process
-
of identifying key regulatory molecules
-
and how they work.
-
So, let's go back to the
sort of the basic unit
-
of gene expression, which is a gene,
-
here shown in the orange arrow,
-
and the non-coding
sequences surrounding it.
-
And you'll see that now I've
added a few more elements
-
to this purple DNA.
-
You see some symbols, a blue square,
-
a round circle that's pink,
and then a yellow triangle.
-
Those are just a way for
me to graphically represent
-
the little bits of DNA sequences
-
that I told you about that
are the regulatory sequences.
-
So, the little round one
happens to very GC-rich,
-
the triangle one is a classical element
-
that's called a TATA box,
-
I'll tell you about a little bit later.
-
And the blue one is yet
another recognition element.
-
So, why are we so interested
in these little stretches
-
of nucleic acid sequence in the genome
-
when it's buried amongst
billions of other sequences?
-
Well, these individual little sequences
-
turn out to be very important
because of where they sit,
-
you'll notice they're sitting
near the top of the arrow,
-
and they are recognized
by very special proteins
-
which are the transcription factors.
-
So, now I've showing you some symbols
-
with little cutouts which
fit into either the square,
-
the circle, or the triangle.
-
So, transcription factors,
-
at least one major family
of transcription factors,
-
are proteins whose
three-dimensional structure
-
is folded into a shape that
allows them to recognize
-
these short stretches
of double-stranded DNA.
-
In fact, largely through interactions
-
with the major group of DNA,
-
and I'll show you a structure
of one in a little bit.
-
So, now it turns out
that there are probably
-
thousands of these transcription factors
-
because the number of genes
that we have to control,
-
as I showed you, is in the
order of 20 or 25,000 genes.
-
And so, it turns out that you
need a pretty large percentage
-
of the genome devoted to encoding
these regulatory proteins
-
in order for a complex organism
like ourselves to survive.
-
Then the other component of this,
-
let's call it the
transcriptional apparatus,
-
is, of course, the enzyme
that catalyzes RNA.
-
And I already told you
that this enzyme on its own
-
can't tell the difference
between random DNA sequence
-
and a gene or a promoter.
-
These other sequence-specific
DNA-binding proteins
-
are the ones that must recruit
-
or otherwise direct RNA polymerase
-
to essentially land on the right
place and at the right time
-
in the genome to turn on
a certain subset of genes
-
that are specifically required
in a specialized cell type,
-
whatever cell you happen to be looking at.
-
So, that is kind of the
first level of complexity
-
of sort of informational interactions
-
between the transcription factors
-
and the more ubiquitous,
-
and I would call it promiscuous
RNA polymerase II enzyme.
-
Well, as it turns out,
-
it took several decades to work out
-
most if not all of the components
-
of this so-called
transcriptional machinery.
-
And it turns out in this
slide I'm showing you
-
things are already starting
to get more complicated.
-
So, not only do you have RNA polymerase,
-
but you have a bunch of other
proteins that go by names
-
like TFIIA, B,
-
you know, D, E, H, F, and so forth.
-
So, it looks like there
are going to be many,
-
many proteins that are necessary
-
to form the transcriptional apparatus.
-
And then on top of that
-
you need sequence-specific
DNA-binding proteins
-
which are already described to you
-
to further inform or
otherwise regulate the process
-
of when a particular
RNA polymerase molecule
-
should be binding to a particular gene.
-
So, that's the sort of overview,
-
now let me get into the specifics
-
and how did we actually discover
these family of proteins.
-
And it'll be interesting for you to see
-
how science in this field evolved.
-
Now, as is often the case
-
when you first try to tackle
a very complex problem,
-
and, of course, we didn't
really know how complex it was
-
when we began these studies,
-
but we assumed it might be complicated,
-
certainly would be more
complicated than systems
-
that we had already had some idea about,
-
for example, in bacteria
or in bacteriophages.
-
We took a lesson from our
studies of bacteriophages
-
and decided that to begin to dissect
-
the molecular complexities
-
of the transcription
process in animal cells,
-
we should start with viruses
-
because we knew that viruses
will enter these host cells,
-
these complex cells that
we ultimately want to study
-
and have to use the
same molecular machinery
-
to transcribe their genes
-
as the host mammalian cell would do.
-
So, this was kind of a trick
-
or a way to look at a molecular window
-
into a complex system
and try to simplify it.
-
And in our case,
-
the early studies of the
late '70s and early '80s
-
involved very simple,
-
one of these simplest
double-stranded DNA viruses
-
called Simian virus 40.
-
And Simian virus 40, of
course, is a monkey virus,
-
which was nice because
it's very close to humans
-
and many things that we could learn
-
about the way this virus uses its host,
-
which are monkey cells, to replicate
-
and to express their RNAs and genes
-
would be applicable to our
studies of humans, as you'll see.
-
And this virus was one of the first
-
whose DNA, its double-stranded
DNA of about 5,000 base pairs
-
was fully sequenced.
-
This was long before a
rapid modern day sequencing
-
was available, so this gave
us a very powerful tool.
-
It basically allowed us to
look at the entire genome
-
of this virus, which
was tiny by comparison,
-
only 5,243 base pairs.
-
But just that information
was already very important
-
'cause it very quickly allowed us,
-
for example, to map where the genes are.
-
And one of the genes encoded a protein
-
called a tumor antigen,
-
which turns out to be
a transcription factor.
-
This then allowed us to get our hands
-
basically to do biochemistry and genetics
-
on the very first eukaryotic
transcription factor,
-
which in this case
happens to be a represser.
-
That is a protein that
when it binds the DNA
-
just the same way as I showed
you for the the model case,
-
it binds through specific
protein DNA interactions.
-
But in this case, actually
shuts transcription down
-
rather than turn it up.
-
In the process of studying
the way that this little virus
-
when it infects a mammalian cell
-
uses proteins like T-antigen
-
to regulate its gene expression,
-
it became clear that it had
to use the host machinery
-
to do the process.
-
And that meant that there
must be monkey proteins
-
that are also involved in activating
-
or repressing genes of this virus.
-
And this then led us to
the most important step,
-
which is to transfer the
technology we learned about viruses
-
and how to work with the
virus transcription factor
-
like T-antigen to the cellular ones.
-
And I'm gonna give you just one example
-
of how the simple jump into the host cell
-
allowed us to discover the first
human transcription factor.
-
So, the question that we then asked
-
back in the early 1980s
was what host molecule
-
is regulating the expression
of transcription of this virus
-
when the virus is in the host?
-
And we knew from the DNA
sequence of the virus
-
that there were these six
very GC-rich snippets of DNA
-
that were regulatory
'cause if we deleted them,
-
the virus no longer would
express the gene of interest.
-
So, we knew that something
was probably responsible
-
for recognizing these GC boxes,
-
and we knew that it wasn't
a virally encoded gene
-
because we had tested
all of the viral genes
-
of which there were
only six to begin with.
-
So, we knew it had to be a host gene
-
and that led us to a whole, I would say,
-
family of experiments
that led to the discovery
-
of sequence-specific mammalian
transcription factors.
-
And as I said, we could have
taken multiple approaches
-
to try to address this complicated issue.
-
I'll just give you one example
-
of using in vitro biochemistry
-
to finally get our hands
on this key sequence
-
specific human transcription factor,
-
which, of course, has a
homologue in the monkey.
-
And the way we did it was very interesting
-
and simple in retrospect,
-
and that is recognizing the fact
-
that whatever this protein was,
-
it had to have the property of recognizing
-
those GC boxes that were sitting
next to the the viral gene.
-
We assume that it must
be a sequence-specific
-
DNA binding-protein, so all we had to do
-
was figure out a way to extract proteins
-
from human cells or monkey cells
-
and then try to fish out
those specific proteins
-
out of the many thousands
of different proteins
-
that were in this gamish
of cellular extract
-
that would be responsible
for discriminating
-
between random DNA sequences
and the specific GC box.
-
And I'll quickly run through
sort of the logic behind this.
-
So, what I'm showing you
here is a solid surface
-
with DNA coupled to it
that is highly enriched
-
for the recognition element, the GC box,
-
which should be the sequence
-
recognized by the protein of interest.
-
Now, we had no idea what this
protein was gonna look like,
-
how many proteins there
were gonna be, and so forth,
-
but we knew it had to
recognize the GC box.
-
So, we're gonna try to
fish this out of a pool
-
of many thousands of other proteins.
-
Now, the the key trick here
-
was that because all cell extracts
-
contain not only one DNA binding protein,
-
but, as I told you, thousands of different
-
DNA binding proteins.
-
But most of them, or in fact in our case,
-
none of the other of several
hundred to a thousand proteins
-
that could bind DNA actually
happen to recognize the GC box,
-
they just bind other DNA sequences.
-
So, to kind of favor our protein
-
being able to bind to our GC box
-
and not have to compete
with all the other proteins,
-
what we did was to add non-specific DNA
-
and mask stoichiometric excess
-
so that all the other proteins
that wouldn't recognize
-
the GC box would still have
some partner to hang onto.
-
And this trick worked very well.
-
So, having the specific
DNA on the solid resin
-
and the non-specific DNA
flowing all over the place,
-
we could capture selectively
the pink molecules here,
-
which are the GC box recognition ones,
-
and the blue-green molecules,
-
of course, predominantly
bind to non-specific DNA.
-
I show you one little
blue one on the column
-
because nothing works
perfectly in real science
-
and tells you that we have
to go through this process
-
iteratively to actually
finally obtain a preparation
-
that's purely pink molecules
with no green-blue ones.
-
Well, that turned out
to work very, very well.
-
And that whole process of
biochemical fractionation
-
followed by a direct affinity
sequence-specific DNA resin
-
gave us the ability to perform
a biochemical purification
-
followed by a molecular cloning
of the transcription factor
-
that encodes the protein SP1.
-
And then we carried out
a bunch of experiments,
-
which I'll tell you next,
-
to show that this protein
-
actually does activate transcription.
-
And of course, we went back and
we proved that this protein,
-
which turned out to be a
rather large polypeptide,
-
can indeed recognize the GC box.
-
And it doesn't matter if it's
a GC box from the SV 0 genome
-
or any other GC box that we
could find in the human genome,
-
it would find that sequence and bind to it
-
and then it would generally
activate transcription.
-
So, this led to the discovery of the first
-
of a very large family
-
of sequence-specific DNA-binding proteins.
-
Now, I told you that
the way these proteins
-
tend to recognize short DNA sequences
-
is to interact with DNA
through the major groove.
-
And here's a perfect example.
-
So, the thick blue model there
-
shows the actual three structures
-
that are called zinc fingers.
-
And the reason they're called zinc fingers
-
is because there are amino
acids that are organized
-
around a center that
contains a zinc molecule
-
which holds the three-dimensional
shape of the polypeptide
-
in a position just right
-
for fitting into the
major groove of the DNA.
-
And the DNA here is shown in pink,
-
and you can see that that blue outline
-
fits right into the
major groove of the DNA,
-
but not to the minor groove.
-
And one of the most important findings
-
was not only the discovery
-
of the first human transcription factor,
-
but the realization that most
if not all sequence-specific
-
DNA-binding transcription factors
-
have a similar structural motif.
-
That is to say some structure
is built to recognize
-
sequences in the major groove of DNA.
-
And these three-dimensional motifs
-
are recognizable as amino
acid sequences in the genome.
-
So, we can now much more
quickly scan the entire sequence
-
of a genome and identify genes
-
that are likely to be DNA-binding proteins
-
as a result of understanding
the structure-function
-
relationships of these DNA-binding
motifs like zinc fingers.
-
So, what I'd like to show you now
-
is that I've only
introduced you to one class
-
of transcription factors,
-
which are the sequence-specific-DNA
binding proteins.
-
Well, I think I gave you a little taste
-
of the level of complexity
-
that's probably going to be needed
-
to be able to build the machine
-
that's ultimately going
to be able to allow you
-
to transcribe every gene in
every cell of a human body.
-
So, that turns out to be a
much more elaborated machine
-
than what I just showed you.
-
So, I wanna show you now
-
what is sort of our
state-of-the-art thinking
-
about what is actually
needed to build the machinery
-
at a gene to allow it to be
expressed and transcribed.
-
And the term I want to introduce you to
-
is the pre-initiation complex.
-
And it's pretty much what it says.
-
It's the complex of multiple subunits
-
that has to essentially land
on the promoter of a gene
-
which will be designated
for later expression.
-
And this is a process that
is probably quite orderly,
-
that is there's an order
of events that happens,
-
which we, by the way,
are not entirely sure
-
exactly what the order
is or even if the order
-
is the same from one gene to the next,
-
but we can kind of see where
it starts and where it ends up.
-
And the pathway in between,
-
I would say is still a little bit murky.
-
And the story here again starts
with a little snippet of DNA
-
called the TATA box,
-
which I already introduced you to briefly.
-
It's an AT-rich sequence which
sits at the five prime end
-
or the beginning of many
genes, but not all genes,
-
maybe 20% of the genes might
contain this AT-rich region.
-
And that AT sequence is the signal
-
or a landmark, if you like,
-
for a particular protein to bind to it.
-
And that protein is called,
-
not surprisingly, the TATA-binding protein
-
'cause it's the TATA sequence.
-
And so, this represents a second class
-
of transcription factors.
-
These are not the type that
I just introduced you to,
-
which are gonna be
different for every gene,
-
the TATA sequence is present
-
in a very large number of genes,
-
so it can't be gene specific,
-
but it turns out to be very crucial
-
for our understanding of
how gene regulation works.
-
So, so you start with
a TATA-binding protein
-
finding a TATA box.
-
We later found out that
the TATA-binding protein
-
rarely functions on its own
and has a bunch of friends
-
that we call TAFs or
TBP associated factors.
-
And now you're talking about an assembly
-
of multi-subunit complex of
almost a million daltons.
-
There are somewhere
between 12 to 15 subunits
-
in addition to the TATA-binding protein
-
that make up this little
complex of proteins
-
that kind of travels around together.
-
And this is found in most cell types,
-
and later on I'll show you
in a subsequent lecture
-
that not every cell type
-
might have exactly the same
compliment of these subunits,
-
but many of them have
this prototypic complex.
-
Is this enough for building
the pre-initiation complex?
-
Unfortunately not.
-
It turns out that there
are a host of other,
-
I'll call them ancillary factors
-
in addition to the multi-subunit
RNA polymerase itself
-
that are necessary for you
to build up an ensemble
-
that is necessary to form an active
-
ready to activate transcriptional
pre-initiation complex
-
or the PIC.
-
And this is kind of the
picture we're getting to,
-
and even this picture
with many, many colors
-
and many, many different polypeptides,
-
you know, that adds up to probably greater
-
than 85 individual proteins
-
that all have to kind of fit
together like a jigsaw puzzle.
-
It's probably not even the whole story,
-
you'll notice I still have one
big red question mark there
-
because I think as we begin
to study specific cell types
-
and specific processes
like embryonic development
-
or germ layer formation,
-
additional components
that are not present here
-
in this prototypic pre-initiation complex
-
will come into play,
-
and that's a subject
of subsequent lecture.
-
But already you can tell that
the transcriptional machinery
-
is anything but simple.
-
So, can we get a better
idea of what transcription
-
might actually look like
and what's happening
-
when a transcription process takes place?
-
So, let me first of all say
that I'm gonna finish my lecture
-
now with a little cartoon,
-
which is our attempt to imagine
-
the events that take place
-
when you form a pre-initiation complex,
-
you bring regulatory proteins
to the activated gene
-
and what happens during this process.
-
Now, keep in mind that
this is at this point
-
mostly a cartoon that
is in our imagination
-
and only parts or if any
of this is probably real,
-
but it gives you a sense of the complexity
-
of the transactions
that have to take place
-
just for one gene to
transcribe and express itself.
-
So, let me show you the movie,
-
and then we'll finish
just by keeping in mind
-
that there's much to be learned.
-
And in my next lecture,
we'll go into the selectivity
-
of this process in specialized cell types.
-
So, now let's see what
this sort of this cartoon
-
of transcription looks like.
-
So, we start off with DNA
-
with some preassembled TFIID molecule,
-
and along comes this other green molecule,
-
which is actually a co-factor,
-
which then forms this very large complex
-
with RNA polymerase.
-
And then a distal
activator protein came in
-
and activated the process.
-
And this molecule, this bluish
molecule that's moved away
-
from the complex is
actually the RNA polymerase.
-
And that little yellow
sort of bead on a string
-
is actually the RNA product.
-
So, that gives you a sense of
things have to happen quickly
-
and yet it involves many, many molecules
-
having to assemble and then disassemble
-
to give you this reaction to happen.
-
And in my next lecture,
-
we'll go into more specific
aspects of this reaction,
-
and particularly during
embryonic development
-
and tissue-specific gene expression.