-
Hey everyone. Welcome to my Deep Q-Learning
-
tutorial. Here's what to expect
-
from this video. Since Deep Q-Learning is
-
a little bit complicated to explain, I'm
-
going to use the Frozen Lake reinforcement
-
learning environment. It's a very simple
-
environment. So, I'm going to do a quick
-
intro on how the environment works. We'll
-
also quickly answer the question of why
-
we need reinforcement learning on such a
-
simple environment. Next, to navigate the
-
environment, we need to use the epsilon-greedy
-
algorithm. Both Q-learning and
-
Deep Q-Learning uses the same
-
algorithm. After we know how to navigate
-
the environment, we're going to take a
-
look at the differences of the output
-
between Q-learning and Deep Q-Learning.
-
In Q-learning, we're training a Q-table.
-
In Deep Q-Learning, we're training a Deep
-
Q-network. We'll work through how the Q-table
-
is trained, then we can see how it
-
is different from training a Deep Q-network.
-
In training a Deep Q-network, we
-
also need a technique called
-
experience replay. And after we have an
-
idea of how Deep Q-Learning works, I'm
-
going to walk through the code and also
-
run it and demo it.
-
Just in case you're not familiar with
-
how this environment works, quick recap.
-
So, the goal is to get the learning agent
-
to figure out how to get to the goal on
-
the bottom right. The actions that the
-
agent can take is left. Internally, we're
-
going to represent that as zero, down is
-
one, right is two, up is three. In this
-
case, if the agent tries to go left or up,
-
it's just going to stay in place. So, in
-
general, if it tries to go off the grid,
-
it's going to stay in the same spot.
-
Internally, we're going to represent each
-
state as: this one is 0, this one as 1,
-
2, 3,
-
4, 5, 6, all the way to
-
14, 15. So, 16 states total. Each attempt at
-
navigating the map is considered one
-
episode. The episode ends if the
-
agent falls into one of these holes or
-
it reaches the goal. When it reaches the
-
goal, it will receive a reward of one. All
-
other states have no reward or penalty.
-
Okay, so that's how this environment
-
works. At this point, you might be
-
wondering why we would need
-
reinforcement learning to solve such a
-
simple map. Why not just use a
-
pathfinding algorithm? There is a twist
-
to this environment. There is a flag
-
called is_slippery.
-
When set to
-
true, the agent doesn't always execute
-
the action that it intends to. For
-
example, if the agent wants to go right,
-
there's only a one-third chance that it will
-
execute this action, and there's a two-thirds
-
chance of it going in an adjacent
-
direction. Now, with slippery turned on,
-
a pathfinding algorithm will not be able
-
to solve this. In fact, I'm not even sure
-
what algorithm could. That's where
-
reinforcement learning comes in. If I
-
don't know how to solve this, I can just
-
give the agent some incentive and let it
-
figure out how to solve this. So, that's
-
where reinforcement learning
-
shines. In terms of how the agent
-
navigates the map, both Deep Q-Learning
-
and Q-learning uses the epsilon-greedy
-
algorithm. Basically, we start with a
-
variable call epsilon set equal
-
to one, and then we generate a
-
random number. If the random number is
-
less than epsilon, we pick a random
-
action. Otherwise, we pick the best action
-
that we know of at the moment. And at the
-
end of each episode, we'll decrease
-
epsilon by a little bit at a time. So,
-
essentially, we start off with 100%
-
random exploration, and then, eventually,
-
near the end of training, we're going to
-
be always selecting the best action to
-
take. Before we jump into the details of
-
how the training works between Q-learning
-
versus Deep Q-Learning, let's
-
take a look at what the outcome looks
-
like. For Q-learning, the output
-
of the training is a Q-table,
-
which is nothing more than a
-
two-dimensional array consisting of 16
-
states by four actions. After training,
-
the whole table is going to be filled
-
with Q-values. For example, it might look
-
something like this.
-
So, in this case, the agent is at
-
state zero. We look at the table and see
-
what the maximum value is. In this case,
-
it's to go right. So, this is the
-
prescribed action. Over at Deep Q-Learning,
-
the output is a Deep Q-network,
-
which is actually nothing more than a
-
regular feed-forward neural network. This is
-
actually what it looks like, but I think
-
we should switch back to the simplified
-
view. The input layer is going to have 16
-
nodes. The output layer is going to have
-
4
-
nodes. The way that we send an input into
-
the input layer is like this: if the
-
agent is at state zero, we're going to
-
set the first node to one and everything
-
else to
-
zero. If the agent is at state one, then
-
the first node is zero and the second node
-
is one, and everything else is zero. So,
-
this is called one-hot encoding. Just
-
one more--if it's at the
-
last... If we want to put in state 15, then
-
everything is all zeros, and state 15 is one.
-
So, that's how the input works. With the
-
input, the output that gets calculated
-
are Q-values. Now, the Q-values are not
-
going to look like what's in the Q-table,
-
but it might be something similar. Let me
-
throw some numbers in here.
-
Okay, so same thing--for this particular
-
input, the best action is the highest Q-value.
-
When training a neural network,
-
essentially, we're trying to train the
-
weights associated with each one of
-
these lines in the neural network and
-
the bias for all the hidden layers. Now,
-
how do we know how many hidden layers we
-
need? In this case, one layer was enough,
-
but you can certainly add more if
-
necessary. And how many nodes do we need
-
in the hidden layer? I tried 16, and it's
-
able to solve the map, so I'm sticking
-
with that. But you can certainly increase
-
or decrease this number to see what
-
happens. Okay, so that's the difference
-
between the output of Q-learning versus
-
Deep Q-Learning.
-
As the agent navigates the map,
-
it is using the Q-learning formula to
-
calculate the Q-values and update the Q-table.
-
Now, the formula might look a
-
little bit scary, but it's actually not
-
that bad. Let's work through some
-
examples. Let's say our agent is in state
-
14 and it's going right to get to state
-
15. We're calculating Q of state 14,
-
action 2, which is this cell. The current
-
value of that cell is zero because we
-
initialized the whole Q-table to zero at
-
the beginning, plus the learning rate. The
-
learning rate is a hyperparameter that
-
we can set. As an example, I'll just put
-
in 0.01
-
times the reward. We get a reward of one
-
because we reached the goal, plus the
-
discount factor, another parameter that
-
we set. I'll just use 0.9 times the max Q-value
-
of the new state. So, the max Q-value
-
value of state 15.
-
Since state 15 is a terminal state, it
-
will never get anything other than zeros
-
in this table.
-
The max is zero. Essentially, these two
-
are gone, subtracted by the same thing
-
that we had here.
-
Okay, so work through the math. This
-
is just 0.01,
-
so we get 0.01 here.
-
Now, let's do another one really
-
quickly. The agent is here, takes a right, Q of
-
13, going to the right. 13 is also all
-
zeros. This is the one we're
-
updating. Here is zero, plus the learning
-
rate times the reward. There is no reward,
-
plus the discount factor times max of
-
the new state, max of state
-
14. Max of state 14 is
-
0.01, subtract it by, again, it's zero.
-
This is equal to 0.009.
-
Okay, it's actually pretty
-
straightforward. Now, how the heck does
-
this formula help find a path? So, if we
-
train enough, the theory is that this
-
number is going to be really close to
-
one, and then the states next to it
-
is going to be 0.9 something,
-
and then the states adjacent
-
to that. It's probably going to be 0.8 something.
-
And if we keep going,
-
we can see that a path is,
-
actually two paths
-
are possible. So, mathematically,
-
this is how the path is found.
-
Over at Deep Q-Learning, the
-
formula is going to look like this: We
-
set Q equal to the reward if the new
-
state is a terminal state. Otherwise, we
-
set it to
-
this part
-
of the Q-learning formula.
-
Let's see how the formula is
-
going be used in training.
-
For Deep Q-Learning, we actually
-
need two neural networks. Let me walk
-
through the steps. The network on the
-
left is called the policy network. This
-
is the network that we're going to do
-
the training on. The one on the right is
-
called the target network. The target
-
network is the one that makes use of the
-
Deep Q-Learning formula. Now, let's walk
-
through the steps of training. Step one
-
is going to be creation of this policy
-
network. Step two, we make a copy of the
-
policy network into the target network.
-
So, basically, we're copying the
-
weights and the bias over here. So both
-
networks are identical. Step number three:
-
the agent navigates the map as usual.
-
Let's say the agent is here in state 14
-
and it's going into state 15. Step three
-
is just navigation. Now, step four, we
-
input state 14. Remember how we have to
-
encode the input? It's going to look like
-
this: 0, 1, 2, 3,
-
12, 13,
-
state 14, 15. The input is going to
-
look like this. As you know, neural
-
networks, when it's created, it comes with
-
a random set of weights and bias. So, with
-
this input, we're actually going to get
-
an output, even though these values are
-
pretty much meaningless.
-
So, we might get some stuff
-
that looks like
-
I'm just putting in some random numbers.
-
Okay, so these Q-values are
-
meaningless, and as a reminder, this is
-
the left action, down, right, and up. Okay.
-
Step five:
-
we do the same thing. We take the exact
-
same input, state 14, and send it into the
-
target network, which will also
-
calculate the exact same numbers
-
because the target network is the same
-
as the policy network currently.
-
Step six: this is where we
-
calculate the Q-value for state 14.
-
We're taking the action of two.
-
It's equal to, since we're going into
-
state 15, it is a terminal state,
-
we set it equal to
-
one.
-
Step seven: Step seven is to set the target.
-
Input state 14, output action two
-
is this node.
-
We take the value that we
-
calculated up on step six, and we replace
-
the Q-value in the output.
-
Step number eight: we take the target Q-values,
-
and we'll use it to train the policy network.
-
So, this value is the one that's
-
really going to change. As you know, with
-
neural networks, it doesn't go straight
-
to one. It's going to go toward that
-
direction, so maybe it'll go to
-
0.01, but if you repeat the training many,
-
many times, it will approach one.
-
Step nine is to repeat the whole thing again.
-
Of course, we're not going to
-
create the policy network or we're not going
-
to make a copy of it. So what we're
-
repeating is steps three through eight.
-
Step ten: after a certain number of steps or
-
episodes, we're going to sync the policy
-
network with the target network, which
-
means we're going to make them identical
-
by copying the weights and biases from
-
the policy over to the target network.
-
After syncing the networks, we're going
-
to repeat nine again and then repeat ten
-
until training is done. Okay, so that's
-
generally how Deep Q-Learning works.
-
That might have been a little bit
-
complicated, so maybe rewind the video,
-
watch it again, and make sure you
-
understand what's happening.
-
To effectively train a neural
-
network, we need to randomize the
-
training data that we send into the
-
neural network. However, if you remember
-
from steps three and four, the agent
-
takes an action, and then we're sending
-
that training data into the neural network.
-
So the question is, how do we
-
randomize the order of a single sample?
-
But that's where experience replay comes
-
in. We need a step 3a where we memorize
-
the agent's experience. As the agent
-
navigates the map, we store the state
-
that it was in, what action it took, the
-
new state that it reached, if there was a
-
reward or not, and if the new state
-
is a terminal state or not. We take that
-
and insert it into the memory, and the
-
memory is nothing more than a Python
-
deque. How a deque works is that as the
-
deque gets full, whatever is at the end gets
-
purged. We need a step 3b. This is the
-
replay step. This is where we take, say, 30
-
random samples from the deque and then
-
pass it on to step four for training. So
-
that's what experience replay is.
-
Before we jump into the code, if you
-
have problems installing Gymnasium,
-
especially on Windows, I've got a video for
-
that. I'm not going to walk through the Q-learning
-
code, but if you are interested
-
in that, jump to my Q-learning code.
-
walkthrough video after watching this
-
one. Also, if you have zero experience
-
with neural networks, you can check out
-
this basic tutorial on neural networks.
-
You also need PyTorch, so head over to PyTorch.org
-
and get that installed. The first
-
thing that we do is create a class to
-
represent our Deep Q-network. I'm calling
-
it DQN. As mentioned before, a Deep Q-network
-
is nothing more than a feed-forward
-
neural network, so there's really
-
nothing special about it. Here, I'm using
-
pretty standard way of creating a neural
-
network using PyTorch. If you look up any
-
PyTorch tutorial, you'll probably see
-
something just like this. So since
-
this is not a PyTorch tutorial, I'm not
-
going to spend too much time explaining
-
how PyTorch works. For this class, we
-
have to inherit the neural network
-
module, which requires us to implement
-
two functions: the _init_ function and the
-
forward function. In the _init_ function,
-
I'm passing in the number of nodes in my
-
input state, hidden layer, and output
-
state. Back at our diagram, we're going to
-
have 16 input states. As mentioned before,
-
I'm using 16 in the hidden layer. That's
-
something that you can adjust yourself.
-
And in the output layer, four nodes.
-
We declare the hidden layer, which
-
has 16 going into 16, and then the output
-
layer, 16 going into four. And then the
-
forward function, x is the training dataset,
-
and we're sending the training data
-
through the neural network. Again, this is
-
pretty common PyTorch code.
-
Next, we need a class to represent
-
the replay memory.
-
So, it's this portion that we're
-
implementing right now.
-
In the _init_ function, we'll pass in a
-
max length and then create the Python
-
deque. In the append function, we're going
-
to append the transition. The transition
-
is this tuple here: state, action, new state,
-
reward, and terminated.
-
The sample function will
-
return a random sample of whatever size
-
we want from the memory, and then the
-
len function simply returns the length
-
of the memory.
-
The Frozen Lake DQL class is
-
we're going to do our training.
-
We set the learning rate and
-
the discount factor. Those are part of
-
the Q-learning formula, and these are
-
values that you can adjust and play
-
around with.
-
The network sync rate is the
-
number of steps the agent takes before
-
syncing the policy and target networks.
-
That's the setting for step ten,
-
where we sync the policy and the target network.
-
We set the replay memory size to
-
1,000 and the replay memory sample size
-
to 32. These are also numbers that you
-
can play around with. Next is the loss
-
function and the optimizer. These two are
-
PyTorch variables. For the loss function,
-
I'm simply using the mean squared error
-
function. For the optimizer, we'll
-
initialize that at a later time. This
-
actions list is to simply map the action
-
numbers into letters for printing.
-
In the train function, we can
-
specify how many episodes we want to
-
train the agent. Whether we want to
-
render the map on screen or what do we
-
want to turn on the slippery flag.
-
We'll instantiate the Frozen Lake
-
environment, and then create a variable
-
to store the number of states. This is
-
going to be 16 and store the number of
-
actions. This is going to be four. Here, we
-
initialize epsilon to one and we
-
instantiate the replay memory. This is
-
going to be size of 1,000. Now, we
-
create the policy DQ network.
-
This is step one: create the
-
policy network.
-
We also create a target network
-
and copy the weights and bias from the
-
policy network into the target network
-
so that
-
is step number two: making the target and
-
policy network identical.
-
Before training, we'll do a
-
print out of the policy network just so
-
we can compare it to the end result. Next,
-
we initialize the optimizer that we
-
declared earlier,
-
and we're simply using the atom
-
optimizer passing in the learning rate.
-
We'll use this rewards per episode list
-
to keep track of the rewards collected
-
per episode.
-
We'll also use this epsilon
-
history list to keep track of epsilon
-
decaying over time.
-
We'll use step count to keep track
-
of the number of steps taken. This is
-
used to determine when to sync the
-
policy and target network again. So that
-
is for step ten: the syncing.
-
Next, we will loop through the
-
number of episodes that we've specified
-
when calling the train function. We'll
-
initialize the map, so the agent starts
-
at state zero. We'll initialize the
-
terminated and truncated flag. Terminated
-
is when the agent falls into the
-
hole or reaches the goal. Truncated is
-
when the agent takes more than 200
-
actions on such a small map, 4x4. This
-
is probably never going to occur, but
-
we'll keep this anyway. Now, we have a
-
while loop checking for the terminated
-
and truncated flag. Keep looping until
-
these conditions are met. This part is
-
basically the epsilon-greedy algorithm.
-
If a random number is less than epsilon,
-
we'll just pick a random action.
-
Otherwise, we'll use the policy network
-
to calculate the set of Q-values, take
-
the maximum of that, extract that item,
-
and that's going to be the best action.
-
So, we have a function here called
-
state_to_dqn input. Let's jump down there.
-
Remember that we have to take the state
-
and encode it. That's what this function does.
-
This type of encoding here.
-
Okay, let's go back.
-
With PyTorch, if we want to perform a
-
prediction, we should call this torch, no_
-
grad, so that it doesn't calculate the
-
stuff needed for training. After
-
selecting the either a random action or
-
the best action, we'll call the step
-
function to execute the action. When we
-
take that action, it's going to return
-
the new state. Whether there is a reward
-
or not. Whether it's a terminal state or
-
we got truncated. We take all that
-
information
-
and put it into our memory. So that is step
-
3a. When we did the epsilon-greedy,
-
that was the step three.
-
After executing the action, we're
-
resetting the state equal to the new
-
state. We increment our step counter if
-
we received an award. Put it on our list.
-
Now, we'll check the memory to see if we
-
have enough training data to do
-
optimization on. Also, we want to check if
-
we have collected at least one reward. If
-
we haven't collected any rewards, there's
-
really no point in optimizing the
-
network. If those conditions are met, we
-
use memory. sample and we pass in the
-
batch size, which was 32. And we get a
-
batch of training data out of the memory.
-
We'll pass that training data into the
-
optimized function, along with the policy
-
network and the target network.
-
Let's jump over to the optimize function.
-
This first line is just looking
-
at the policy network and getting the
-
number of input nodes. We expect this to
-
be 16. The current q_list and target_q_list.
-
So this is the current Q-list.
-
This is the target Q-list.
-
What I'm about to describe now is
-
step 3b: replaying the experience.
-
So, to replay the experience, we're
-
looping through the training data inside
-
the mini batch. Let's jump down here for
-
a moment. So this is step number four.
-
We're taking the states and passing it
-
into the policy network to calculate the
-
current list of Q-values.
-
Step number four: the output is
-
the list of Q-values.
-
Step number five: we pass in the
-
same thing to the target network.
-
Step five: we get the Q-values
-
out here, which should be the same as the
-
Q-values from the policy network.
-
Step number six is up here. So, a
-
reminder of what it looks like. Step six
-
is using this formula here.
-
So that's what we have here if
-
terminated. If terminated, just return the
-
reward. Otherwise, use that second
-
formula. So, now that we have a target,
-
we'll go to step seven. Step seven is
-
taking the output of step six and
-
replacing the respective Q-value.
-
And that's what we're doing here.
-
We're replacing the Q-value of that
-
particular action with the target that
-
was calculated up and above.
-
Step number eight: step eight is to
-
take the target values and use that to
-
train the current Q-values.
-
So that's what we have here. We're using
-
the loss function, pass in the current
-
set of Q-values, plus the target set of Q-values,
-
and then this is just standard PyTorch
-
code to optimize the policy
-
network. Now, step nine: step nine is to
-
repeat steps three to eight, starting from
-
navigation to optimizing the network.
-
Basically, that's just continuing this
-
inner while loop of stepping through the
-
states, and this outer for loop of
-
stepping through each episode.
-
Step ten is where we sync the
-
policy network and the target network.
-
And we do that down here if the
-
number of steps taken is greater than
-
the network sync rate that we set, then
-
we copy the policy network into the
-
target network, and then we reset the
-
step counter. Also, after each episode, we
-
should be decaying the epsilon value.
-
After syncing the network,
-
we repeat step nine again, which is
-
repeating, basically, the navigation and
-
training again.
-
So, it's just going to go back up
-
here and do this all over again, all the
-
way until we have finished the number of
-
episodes. So that was training. After
-
training, we close the environment. We can
-
save the policy or the weights and bias
-
into a file. I'm hardcoding it here. You
-
can definitely make this more dynamic.
-
Here, I'm creating a new graph and I'm
-
basically graphing the rewards collected
-
per episode. Also, I'm graphing the
-
epsilon history here. After graphing, I'm
-
saving the graphs into an image.
-
Alright, so that was the train function. We
-
have talked about the optimize
-
function. Let me fold that.
-
We already looked at that. Now, the
-
test function is going to run the Frozen
-
Lake environment with the policy that we
-
learned from the train function. We can
-
also pass in the number of episodes and
-
whether we want to turn on slippery or
-
not. So the rest of the code is going to
-
look pretty similar to what we had in
-
the train function. We instantiate the
-
environment, get the number of states and
-
actions, 16 and 4 here. We declare the
-
policy network, loaded from the file that
-
we saved from training. This is just a
-
PyTorch code to switch the policy network
-
to prediction mode or evaluation mode;
-
rather, than training mode. We'll print
-
the train policy, and then we'll loop
-
over the episodes. We set the agent up
-
top, you've seen this before, keep looping
-
until the agent gets terminated or
-
truncated. Here, we're selecting the best
-
action out of the policy network and
-
executing the action and then close the
-
environment. So, ideally, when we run this,
-
we're going to see the agent navigate
-
the map and reach the goal. However, if
-
slippery is on, there is no guarantee
-
that the agent is going to solve it in
-
one try. It might take a couple of tries
-
for the agent to solve the map.
-
Alright, down at the main function,
-
we create an instance of the Frozen Lake
-
DQL class first. First, we're going to try
-
non-slippery. We'll train it for 1,000
-
times or 1,000 episodes. We'll run the
-
test four times just because the map
-
pops up and goes away when the agent
-
gets to the goal, so we want to run it
-
a couple times. So, we actually can see it
-
on the screen. Okay? I'm just hitting
-
ctrl, F5.
-
Alright, training is done and looks
-
like the agent is able to solve the map.
-
We also have a print out of the policy
-
that it's learned. I print it in a way
-
that matches up with the grid. Let's jump
-
back to this in a second. Let's go to our
-
files. So, after training, we have a new graph
-
on the left side. That's the number
-
of rewards collected over the 1,000
-
episodes. As you can see, over time it's
-
improving, and finally, at the end it's
-
getting a lot of rewards. On the right
-
side, we can see epsilon decaying,
-
starting from one slowly, slowly down to
-
zero. Also, another file that gets created
-
is a binary file. So we can't really
-
display it, but it contains the weights
-
and biases of the policy network.
-
So, if we want to do the test
-
again, we don't need the training. We can
-
comment out the training line and just
-
run this again.
-
And we can see the agent solve the
-
map.
-
We can look at what policy the agent
-
learned, the way this is printed, it
-
matches up with the map. So at state zero,
-
the best action was to go right, which is
-
state one. At state one, the best action
-
is going to go right again. At state two,
-
go down. At state six, go down again. At state ten,
-
go down again. At state 14,
-
go to the right.
-
Okay.
-
So it was right, right, down, down,
-
down, right. So remember, earlier that
-
there were two possible paths, the one
-
that we actually learned plus the one
-
going down. Because epsilon-greedy has a
-
randomness factor. Whether the agent
-
learns the top path or the bottom path
-
is just somewhat random. If we train
-
again, maybe it'll go down the next time.
-
Now, let's turn the slippery factor on,
-
uncomment the training line.
-
Now with slippery on, we expect the
-
agent to fail a lot more often or to
-
fall in the holes a lot more often.
-
Let me just triple the number of training
-
and let me do ten times for testing.
-
Okay, I'm running this again.
-
Again, with slippery turn on, there's no
-
guarantee that after training, the agent
-
is going to be able to pass every episode.
-
Let's make it come up.
-
You see it trying to go to the bottom
-
right, but because of slippery, it just
-
failed right there. But there you go.
-
It was able to solve it. Let's look at the
-
graph. The results compared to the
-
non-slippery surface is significantly
-
worse. Also, you want to be careful
-
getting stuck in spots like these, which
-
means, if I was unlucky enough to set my
-
training episodes to maybe 2,200, and it
-
gets stuck here, which means it learned a
-
bad policy, it won't be able to solve the
-
map. So, in your training, if you're
-
finding that the agent is not able to
-
solve the map when slippery is turned on,
-
it might be because of things like these.
-
Okay, that concludes our Deep Q-Learning
-
tutorial. I love to hear feedback from
-
you. Was the explanation easy to
-
understand? What can I improve on? And
-
what other topics are you interested in?