- My name is Bob Tjian,
I'm a Professor of MCB
at the University of
California at Berkeley,
and I'm also serving as the
President of the Howard Hughes.
I'm going to spend the
next 25 or 30 minutes
telling you about some fundamentals
of one of the most important
molecular processes
in living cells, which is
the expression of genes
through a process called transcription.
Now, first, to understand
what gene expression means,
you have to have a sense
of what we tend to refer to
in the field as a central
dogma of molecular biology.
Another way to think about this
is the flow of biological
information from DNA,
in other words, our chromosomes,
which every cell has its compliment,
to be transcribed into a
sister molecule called RNA.
So, this process of
converting DNA into RNA
is called transcription,
and that is the topic of this lecture.
This process is very complicated,
as you'll see by the
end of my two lectures,
and it is very important
for many, many fundamental
processes in biology.
So, what I'm gonna
spend today's lecture on
is the discovery of a large family
of transcription proteins.
These are factors we call
them that are key molecules
that regulate the use
of genetic information
that has been encoded in the genome.
Now, transcription factors or proteins
are involved in many
fundamental aspects of biology,
including embryonic development,
cellular differentiation, and cell fate.
In other words, pretty much
what your cells are doing,
how a tissue works,
and how an organism
survives and reproduces
is dependent on the
process of gene expression.
And the first step in this
process is transcription.
Now, there are many other reasons
why a large group of people and scientists
are interested in transcription,
and another reason is that understanding
the fundamental molecular mechanisms
that controls transcription in humans
or in any other organism
can inform us and teach
us about what happens
when something goes wrong,
for example, in diseases.
And I list here just a few diseases
that we could study as a result
of understanding the
structure and function
of these transcription factor proteins
that I'm going to be telling you about.
And of course, the hope
is that in understanding
the molecular underpinnings of
complex diseases like cancer,
diabetes, Parkinson's, and so forth,
that we will be able to develop and use
better, more specific therapeutic drugs
and also to develop more accurate
and rapid diagnostic tools.
So, those are a couple of the reasons
why many of us have spent,
in my case, over 30 years
studying this process of
transcriptional regulation.
Now, to get the whole thing started,
I have to give you a sense
of what the magnitude of the problem is.
So, imagine that one would
really like to understand
how this process of decoding
the genome happens in humans.
So, as you may know,
the human genome has
some 3 billion base pairs
or bits of genetic information,
and that encodes roughly 22,000 genes.
These are stretches of DNA sequence
that encode ultimately a product
that is a protein which actually
makes the cells function.
So, as I already explained to you,
there's this flow of
biological information
where you have to extract the
information buried in DNA,
convert it into RNA.
And what I'm not gonna
tell you about today
is the process of going
from RNA to protein,
which is a reaction called
a translational reaction.
I'm going to instead just
focus on the first step
of converting DNA into RNA,
which is the process of transcription.
Now, one of the most amazing results
that we got over the last decade or so
was when the human genome
was entirely sequenced,
the first few that were sequenced,
we realized that actually
the number of genes in humans
is not vastly different
from many other organisms,
even simple organisms like little worms
or fruit flies and so forth.
That is roughly 22 to 25,000 genes
is all the number of genes
that all of these
different organisms have.
And yet, anybody looking at us
versus a little roundworm
in the soil or a fruit fly
can tell that we're a
much more complex organism
with a much bigger brain,
much more complex behavior, and so forth.
So, how does this happen?
Part of the answer to this
very interesting mystery
or paradox lies in the way
that genes are organized
and how they're regulated.
And one of the most striking results
of the genome sequencing
project was to realize
that a vast, vast majority
of the DNA in our chromosomes
is actually not coding for
specific gene products,
and that only roughly 3% of
the DNA is actually encoding
let's call those little arrows
that I show you on this purple DNA
are the gene coding regions.
So, you'll notice that there's
a lot of non-arrow sequences,
which I'll show you in
this next slide as green.
These are non-coding regions.
So, the vast majority, 97%
or greater is non-coding,
so what are these other sequences doing?
And of course, it turns
out that these sequences
carry very important
little fragments of DNA,
which we call regulatory sequences.
And these are the sequences
that actually control
whether a gene gets turned on or not.
And I'll be spending much
of the next 20 minutes
telling you about how
this process all works
and what these little
bits of DNA sequences
actually function to
control gene expression.
Now, the other thing that I
have to bring you up to date on
is this mysterious process
we're calling transcription,
which reads double-stranded DNA
and then makes a related molecule,
which is a single-stranded RNA molecule,
which is a informational molecule.
That reaction is catalyzed
by a very complex
multi-subunit enzyme
called RNA polymerase II.
Now, there's the roman
numeral II at the end of this
because there were actually
three enzymes in most mammals,
at least three enzymes that
carry out different processes
and different types of RNA production.
But I'm only gonna tell you
about the ones that make
the classical messenger RNA,
which then ultimately becomes proteins.
So, now one of the things that we learned
early on in the study of mammalian
or other multicellular organism
transcription processes
is that despite the fact that this enzyme
is quite complex in its structure,
it turns out to be an
enzyme that's nevertheless
needs a lot of help to do its job.
So, on its own, this RNA polymerase II
cannot tell the difference
between the non-coding regions
of the genome and places where
it's supposed to be coding
or reading to make the
appropriate messenger RNAs.
So, this sort of leads you
to think that there must be
a number of other factors
that somehow direct
RNA polymerase to the right
place at the right time
in the genome of every cell in your body
so that the right products get made
so each cell in your body
is functioning properly.
And this is where things
get really interesting
is some 25, 30 years ago,
a number of laboratories took on the job
of hunting for these elusive
and, as it turned out,
a specialized protein factors
that recognize these little
stretches of DNA sequences
that I've been telling you about
that make up the vast majority
of the non-coding part of the genome.
And how these proteins then can recognize
and ultimately physically interact
with these little bits
of genetic information
to then turn genes on or off.
Now, in this lecture,
I can't go into all the details
of the types of experiments
or the ranges of experiments
that many, many laboratories
have done over the last two decades
to finally work out this molecular puzzle
of how transcription works.
But I can tell you that
there are fundamentally
two major approaches that have been taken
over the last few decades
to kind of get a parts list
of the machinery that decodes the genome
and carries out the
process of transcription.
One is kind of the old style,
I'll call it bucket biochemistry
or take a live cell, crush it up,
spread out all of its parts
and then try to figure out
how to put it back together again,
that's what I call in vitro biochemistry.
And the other one is in vivo genetics
where you effectively use genetic tools,
mutagenesis to go in there
and selectively remove
or knock down or knock out
certain genes and gene products
and then ask what is the
consequence on that cell
or that organism?
Both of these technologies
are very powerful
and highly complementary,
and they continue to be used.
Today, I will focus primarily
on the in vitro biochemical techniques
which led us to the discovery
of the first few classes
of transcription factors.
And in subsequent lectures,
we'll go to more recent technologies
that allows us to sort of
speed up this whole process
of identifying key regulatory molecules
and how they work.
So, let's go back to the
sort of the basic unit
of gene expression, which is a gene,
here shown in the orange arrow,
and the non-coding
sequences surrounding it.
And you'll see that now I've
added a few more elements
to this purple DNA.
You see some symbols, a blue square,
a round circle that's pink,
and then a yellow triangle.
Those are just a way for
me to graphically represent
the little bits of DNA sequences
that I told you about that
are the regulatory sequences.
So, the little round one
happens to very GC-rich,
the triangle one is a classical element
that's called a TATA box,
I'll tell you about a little bit later.
And the blue one is yet
another recognition element.
So, why are we so interested
in these little stretches
of nucleic acid sequence in the genome
when it's buried amongst
billions of other sequences?
Well, these individual little sequences
turn out to be very important
because of where they sit,
you'll notice they're sitting
near the top of the arrow,
and they are recognized
by very special proteins
which are the transcription factors.
So, now I've showing you some symbols
with little cutouts which
fit into either the square,
the circle, or the triangle.
So, transcription factors,
at least one major family
of transcription factors,
are proteins whose
three-dimensional structure
is folded into a shape that
allows them to recognize
these short stretches
of double-stranded DNA.
In fact, largely through interactions
with the major group of DNA,
and I'll show you a structure
of one in a little bit.
So, now it turns out
that there are probably
thousands of these transcription factors
because the number of genes
that we have to control,
as I showed you, is in the
order of 20 or 25,000 genes.
And so, it turns out that you
need a pretty large percentage
of the genome devoted to encoding
these regulatory proteins
in order for a complex organism
like ourselves to survive.
Then the other component of this,
let's call it the
transcriptional apparatus,
is, of course, the enzyme
that catalyzes RNA.
And I already told you
that this enzyme on its own
can't tell the difference
between random DNA sequence
and a gene or a promoter.
These other sequence-specific
DNA-binding proteins
are the ones that must recruit
or otherwise direct RNA polymerase
to essentially land on the right
place and at the right time
in the genome to turn on
a certain subset of genes
that are specifically required
in a specialized cell type,
whatever cell you happen to be looking at.
So, that is kind of the
first level of complexity
of sort of informational interactions
between the transcription factors
and the more ubiquitous,
and I would call it promiscuous
RNA polymerase II enzyme.
Well, as it turns out,
it took several decades to work out
most if not all of the components
of this so-called
transcriptional machinery.
And it turns out in this
slide I'm showing you
things are already starting
to get more complicated.
So, not only do you have RNA polymerase,
but you have a bunch of other
proteins that go by names
like TFIIA, B,
you know, D, E, H, F, and so forth.
So, it looks like there
are going to be many,
many proteins that are necessary
to form the transcriptional apparatus.
And then on top of that
you need sequence-specific
DNA-binding proteins
which are already described to you
to further inform or
otherwise regulate the process
of when a particular
RNA polymerase molecule
should be binding to a particular gene.
So, that's the sort of overview,
now let me get into the specifics
and how did we actually discover
these family of proteins.
And it'll be interesting for you to see
how science in this field evolved.
Now, as is often the case
when you first try to tackle
a very complex problem,
and, of course, we didn't
really know how complex it was
when we began these studies,
but we assumed it might be complicated,
certainly would be more
complicated than systems
that we had already had some idea about,
for example, in bacteria
or in bacteriophages.
We took a lesson from our
studies of bacteriophages
and decided that to begin to dissect
the molecular complexities
of the transcription
process in animal cells,
we should start with viruses
because we knew that viruses
will enter these host cells,
these complex cells that
we ultimately want to study
and have to use the
same molecular machinery
to transcribe their genes
as the host mammalian cell would do.
So, this was kind of a trick
or a way to look at a molecular window
into a complex system
and try to simplify it.
And in our case,
the early studies of the
late '70s and early '80s
involved very simple,
one of these simplest
double-stranded DNA viruses
called Simian virus 40.
And Simian virus 40, of
course, is a monkey virus,
which was nice because
it's very close to humans
and many things that we could learn
about the way this virus uses its host,
which are monkey cells, to replicate
and to express their RNAs and genes
would be applicable to our
studies of humans, as you'll see.
And this virus was one of the first
whose DNA, its double-stranded
DNA of about 5,000 base pairs
was fully sequenced.
This was long before a
rapid modern day sequencing
was available, so this gave
us a very powerful tool.
It basically allowed us to
look at the entire genome
of this virus, which
was tiny by comparison,
only 5,243 base pairs.
But just that information
was already very important
'cause it very quickly allowed us,
for example, to map where the genes are.
And one of the genes encoded a protein
called a tumor antigen,
which turns out to be
a transcription factor.
This then allowed us to get our hands
basically to do biochemistry and genetics
on the very first eukaryotic
transcription factor,
which in this case
happens to be a represser.
That is a protein that
when it binds the DNA
just the same way as I showed
you for the the model case,
it binds through specific
protein DNA interactions.
But in this case, actually
shuts transcription down
rather than turn it up.
In the process of studying
the way that this little virus
when it infects a mammalian cell
uses proteins like T-antigen
to regulate its gene expression,
it became clear that it had
to use the host machinery
to do the process.
And that meant that there
must be monkey proteins
that are also involved in activating
or repressing genes of this virus.
And this then led us to
the most important step,
which is to transfer the
technology we learned about viruses
and how to work with the
virus transcription factor
like T-antigen to the cellular ones.
And I'm gonna give you just one example
of how the simple jump into the host cell
allowed us to discover the first
human transcription factor.
So, the question that we then asked
back in the early 1980s
was what host molecule
is regulating the expression
of transcription of this virus
when the virus is in the host?
And we knew from the DNA
sequence of the virus
that there were these six
very GC-rich snippets of DNA
that were regulatory
'cause if we deleted them,
the virus no longer would
express the gene of interest.
So, we knew that something
was probably responsible
for recognizing these GC boxes,
and we knew that it wasn't
a virally encoded gene
because we had tested
all of the viral genes
of which there were
only six to begin with.
So, we knew it had to be a host gene
and that led us to a whole, I would say,
family of experiments
that led to the discovery
of sequence-specific mammalian
transcription factors.
And as I said, we could have
taken multiple approaches
to try to address this complicated issue.
I'll just give you one example
of using in vitro biochemistry
to finally get our hands
on this key sequence
specific human transcription factor,
which, of course, has a
homologue in the monkey.
And the way we did it was very interesting
and simple in retrospect,
and that is recognizing the fact
that whatever this protein was,
it had to have the property of recognizing
those GC boxes that were sitting
next to the the viral gene.
We assume that it must
be a sequence-specific
DNA binding-protein, so all we had to do
was figure out a way to extract proteins
from human cells or monkey cells
and then try to fish out
those specific proteins
out of the many thousands
of different proteins
that were in this gamish
of cellular extract
that would be responsible
for discriminating
between random DNA sequences
and the specific GC box.
And I'll quickly run through
sort of the logic behind this.
So, what I'm showing you
here is a solid surface
with DNA coupled to it
that is highly enriched
for the recognition element, the GC box,
which should be the sequence
recognized by the protein of interest.
Now, we had no idea what this
protein was gonna look like,
how many proteins there
were gonna be, and so forth,
but we knew it had to
recognize the GC box.
So, we're gonna try to
fish this out of a pool
of many thousands of other proteins.
Now, the the key trick here
was that because all cell extracts
contain not only one DNA binding protein,
but, as I told you, thousands of different
DNA binding proteins.
But most of them, or in fact in our case,
none of the other of several
hundred to a thousand proteins
that could bind DNA actually
happen to recognize the GC box,
they just bind other DNA sequences.
So, to kind of favor our protein
being able to bind to our GC box
and not have to compete
with all the other proteins,
what we did was to add non-specific DNA
and mask stoichiometric excess
so that all the other proteins
that wouldn't recognize
the GC box would still have
some partner to hang onto.
And this trick worked very well.
So, having the specific
DNA on the solid resin
and the non-specific DNA
flowing all over the place,
we could capture selectively
the pink molecules here,
which are the GC box recognition ones,
and the blue-green molecules,
of course, predominantly
bind to non-specific DNA.
I show you one little
blue one on the column
because nothing works
perfectly in real science
and tells you that we have
to go through this process
iteratively to actually
finally obtain a preparation
that's purely pink molecules
with no green-blue ones.
Well, that turned out
to work very, very well.
And that whole process of
biochemical fractionation
followed by a direct affinity
sequence-specific DNA resin
gave us the ability to perform
a biochemical purification
followed by a molecular cloning
of the transcription factor
that encodes the protein SP1.
And then we carried out
a bunch of experiments,
which I'll tell you next,
to show that this protein
actually does activate transcription.
And of course, we went back and
we proved that this protein,
which turned out to be a
rather large polypeptide,
can indeed recognize the GC box.
And it doesn't matter if it's
a GC box from the SV 0 genome
or any other GC box that we
could find in the human genome,
it would find that sequence and bind to it
and then it would generally
activate transcription.
So, this led to the discovery of the first
of a very large family
of sequence-specific DNA-binding proteins.
Now, I told you that
the way these proteins
tend to recognize short DNA sequences
is to interact with DNA
through the major groove.
And here's a perfect example.
So, the thick blue model there
shows the actual three structures
that are called zinc fingers.
And the reason they're called zinc fingers
is because there are amino
acids that are organized
around a center that
contains a zinc molecule
which holds the three-dimensional
shape of the polypeptide
in a position just right
for fitting into the
major groove of the DNA.
And the DNA here is shown in pink,
and you can see that that blue outline
fits right into the
major groove of the DNA,
but not to the minor groove.
And one of the most important findings
was not only the discovery
of the first human transcription factor,
but the realization that most
if not all sequence-specific
DNA-binding transcription factors
have a similar structural motif.
That is to say some structure
is built to recognize
sequences in the major groove of DNA.
And these three-dimensional motifs
are recognizable as amino
acid sequences in the genome.
So, we can now much more
quickly scan the entire sequence
of a genome and identify genes
that are likely to be DNA-binding proteins
as a result of understanding
the structure-function
relationships of these DNA-binding
motifs like zinc fingers.
So, what I'd like to show you now
is that I've only
introduced you to one class
of transcription factors,
which are the sequence-specific-DNA
binding proteins.
Well, I think I gave you a little taste
of the level of complexity
that's probably going to be needed
to be able to build the machine
that's ultimately going
to be able to allow you
to transcribe every gene in
every cell of a human body.
So, that turns out to be a
much more elaborated machine
than what I just showed you.
So, I wanna show you now
what is sort of our
state-of-the-art thinking
about what is actually
needed to build the machinery
at a gene to allow it to be
expressed and transcribed.
And the term I want to introduce you to
is the pre-initiation complex.
And it's pretty much what it says.
It's the complex of multiple subunits
that has to essentially land
on the promoter of a gene
which will be designated
for later expression.
And this is a process that
is probably quite orderly,
that is there's an order
of events that happens,
which we, by the way,
are not entirely sure
exactly what the order
is or even if the order
is the same from one gene to the next,
but we can kind of see where
it starts and where it ends up.
And the pathway in between,
I would say is still a little bit murky.
And the story here again starts
with a little snippet of DNA
called the TATA box,
which I already introduced you to briefly.
It's an AT-rich sequence which
sits at the five prime end
or the beginning of many
genes, but not all genes,
maybe 20% of the genes might
contain this AT-rich region.
And that AT sequence is the signal
or a landmark, if you like,
for a particular protein to bind to it.
And that protein is called,
not surprisingly, the TATA-binding protein
'cause it's the TATA sequence.
And so, this represents a second class
of transcription factors.
These are not the type that
I just introduced you to,
which are gonna be
different for every gene,
the TATA sequence is present
in a very large number of genes,
so it can't be gene specific,
but it turns out to be very crucial
for our understanding of
how gene regulation works.
So, so you start with
a TATA-binding protein
finding a TATA box.
We later found out that
the TATA-binding protein
rarely functions on its own
and has a bunch of friends
that we call TAFs or
TBP associated factors.
And now you're talking about an assembly
of multi-subunit complex of
almost a million daltons.
There are somewhere
between 12 to 15 subunits
in addition to the TATA-binding protein
that make up this little
complex of proteins
that kind of travels around together.
And this is found in most cell types,
and later on I'll show you
in a subsequent lecture
that not every cell type
might have exactly the same
compliment of these subunits,
but many of them have
this prototypic complex.
Is this enough for building
the pre-initiation complex?
Unfortunately not.
It turns out that there
are a host of other,
I'll call them ancillary factors
in addition to the multi-subunit
RNA polymerase itself
that are necessary for you
to build up an ensemble
that is necessary to form an active
ready to activate transcriptional
pre-initiation complex
or the PIC.
And this is kind of the
picture we're getting to,
and even this picture
with many, many colors
and many, many different polypeptides,
you know, that adds up to probably greater
than 85 individual proteins
that all have to kind of fit
together like a jigsaw puzzle.
It's probably not even the whole story,
you'll notice I still have one
big red question mark there
because I think as we begin
to study specific cell types
and specific processes
like embryonic development
or germ layer formation,
additional components
that are not present here
in this prototypic pre-initiation complex
will come into play,
and that's a subject
of subsequent lecture.
But already you can tell that
the transcriptional machinery
is anything but simple.
So, can we get a better
idea of what transcription
might actually look like
and what's happening
when a transcription process takes place?
So, let me first of all say
that I'm gonna finish my lecture
now with a little cartoon,
which is our attempt to imagine
the events that take place
when you form a pre-initiation complex,
you bring regulatory proteins
to the activated gene
and what happens during this process.
Now, keep in mind that
this is at this point
mostly a cartoon that
is in our imagination
and only parts or if any
of this is probably real,
but it gives you a sense of the complexity
of the transactions
that have to take place
just for one gene to
transcribe and express itself.
So, let me show you the movie,
and then we'll finish
just by keeping in mind
that there's much to be learned.
And in my next lecture,
we'll go into the selectivity
of this process in specialized cell types.
So, now let's see what
this sort of this cartoon
of transcription looks like.
So, we start off with DNA
with some preassembled TFIID molecule,
and along comes this other green molecule,
which is actually a co-factor,
which then forms this very large complex
with RNA polymerase.
And then a distal
activator protein came in
and activated the process.
And this molecule, this bluish
molecule that's moved away
from the complex is
actually the RNA polymerase.
And that little yellow
sort of bead on a string
is actually the RNA product.
So, that gives you a sense of
things have to happen quickly
and yet it involves many, many molecules
having to assemble and then disassemble
to give you this reaction to happen.
And in my next lecture,
we'll go into more specific
aspects of this reaction,
and particularly during
embryonic development
and tissue-specific gene expression.