-
Hey everyone. Welcome to my Deep Q-Learning
-
tutorial. Here's what to expect
-
from this video. Since Deep Q-Learning is
-
a little bit complicated to explain, I'm
-
going to use the Frozen Lake reinforcement
-
learning environment. It's a very simple
-
environment. So, I'm going to do a quick
-
intro on how the environment works. We'll
-
also quickly answer the question of why
-
we need reinforcement learning on such a
-
simple environment. Next, to navigate the
-
environment, we need to use the epsilon-greedy
-
algorithm. Both Q-learning and
-
Deep Q-Learning uses the same
-
algorithm. After we know how to navigate
-
the environment, we're going to take a
-
look at the differences of the output
-
between Q-learning and Deep Q-Learning.
-
In Q-learning, we're training a Q-table.
-
In Deep Q-Learning, we're training a Deep
-
Q-network. We'll work through how the Q-table
-
is trained, then we can see how it
-
is different from training a Deep Q-network.
-
In training a Deep Q-network, we
-
also need a technique called
-
experience replay. And after we have an
-
idea of how Deep Q-Learning works, I'm
-
going to walk through the code and also
-
run it and demo it.
-
Just in case you're not familiar with
-
how this environment works, quick recap.
-
So, the goal is to get the learning agent
-
to figure out how to get to the goal on
-
the bottom right. The actions that the
-
agent can take is left. Internally, we're
-
going to represent that as zero, down is
-
one, right is two, up is three. In this
-
case, if the agent tries to go left or up,
-
it's just going to stay in place. So, in
-
general, if it tries to go off the grid,
-
it's going to stay in the same spot.
-
Internally, we're going to represent each
-
state as: this one is 0, this one as 1,
-
2, 3,
-
4, 5, 6, all the way to
-
14, 15. So, 16 states total. Each attempt at
-
navigating the map is considered one
-
episode. The episode ends if the
-
agent falls into one of these holes or
-
it reaches the goal. When it reaches the
-
goal, it will receive a reward of one. All
-
other states have no reward or penalty.
-
Okay, so that's how this environment
-
works. At this point, you might be
-
wondering why we would need
-
reinforcement learning to solve such a
-
simple map. Why not just use a
-
pathfinding algorithm? There is a twist
-
to this environment. There is a flag
-
called is_slippery.
-
When set to
-
true, the agent doesn't always execute
-
the action that it intends to. For
-
example, if the agent wants to go right,
-
there's only a one-third chance that it will
-
execute this action, and there's a two-thirds
-
chance of it going in an adjacent
-
direction. Now, with slippery turned on,
-
a pathfinding algorithm will not be able
-
to solve this. In fact, I'm not even sure
-
what algorithm could. That's where
-
reinforcement learning comes in. If I
-
don't know how to solve this, I can just
-
give the agent some incentive and let it
-
figure out how to solve this. So, that's
-
where reinforcement learning
-
shines. In terms of how the agent
-
navigates the map, both Deep Q-Learning
-
and Q-learning uses the epsilon-greedy
-
algorithm. Basically, we start with a
-
variable call epsilon set equal
-
to one, and then we generate a
-
random number. If the random number is
-
less than epsilon, we pick a random
-
action. Otherwise, we pick the best action
-
that we know of at the moment. And at the
-
end of each episode, we'll decrease
-
epsilon by a little bit at a time. So,
-
essentially, we start off with 100%
-
random exploration, and then, eventually,
-
near the end of training, we're going to
-
be always selecting the best action to
-
take. Before we jump into the details of
-
how the training works between Q-learning
-
versus Deep Q-Learning, let's
-
take a look at what the outcome looks
-
like. For Q-learning, the output
-
of the training is a Q-table,
-
which is nothing more than a
-
two-dimensional array consisting of 16
-
states by four actions. After training,
-
the whole table is going to be filled
-
with Q-values. For example, it might look
-
something like this.
-
So, in this case, the agent is at
-
state zero. We look at the table and see
-
what the maximum value is. In this case,
-
it's to go right. So, this is the
-
prescribed action. Over at Deep Q-Learning,
-
the output is a Deep Q-network,
-
which is actually nothing more than a
-
regular feed-forward neural network. This is
-
actually what it looks like, but I think
-
we should switch back to the simplified
-
view. The input layer is going to have 16
-
nodes. The output layer is going to have
-
4
-
nodes. The way that we send an input into
-
the input layer is like this: if the
-
agent is at state zero, we're going to
-
set the first node to one and everything
-
else to
-
zero. If the agent is at state one, then
-
the first node is zero and the second node
-
is one, and everything else is zero. So,
-
this is called one-hot encoding. Just
-
one more--if it's at the
-
last... If we want to put in state 15, then
-
everything is all zeros, and state 15 is one.
-
So, that's how the input works. With the
-
input, the output that gets calculated
-
are Q-values. Now, the Q-values are not
-
going to look like what's in the Q-table,
-
but it might be something similar. Let me
-
throw some numbers in here.
-
Okay, so same thing--for this particular
-
input, the best action is the highest Q-value.
-
When training a neural network,
-
essentially, we're trying to train the
-
weights associated with each one of
-
these lines in the neural network and
-
the bias for all the hidden layers. Now,
-
how do we know how many hidden layers we
-
need? In this case, one layer was enough,
-
but you can certainly add more if
-
necessary. And how many nodes do we need
-
in the hidden layer? I tried 16, and it's
-
able to solve the map, so I'm sticking
-
with that. But you can certainly increase
-
or decrease this number to see what
-
happens. Okay, so that's the difference
-
between the output of Q-learning versus
-
Deep Q-Learning.
-
As the agent navigates the map,
-
it is using the Q-learning formula to
-
calculate the Q-values and update the Q-table.
-
Now, the formula might look a
-
little bit scary, but it's actually not
-
that bad. Let's work through some
-
examples. Let's say our agent is in state
-
14 and it's going right to get to state
-
15. We're calculating Q of state 14,
-
action 2, which is this cell. The current
-
value of that cell is zero because we
-
initialized the whole Q-table to zero at
-
the beginning, plus the learning rate. The
-
learning rate is a hyperparameter that
-
we can set. As an example, I'll just put
-
in 0.01
-
times the reward. We get a reward of one
-
because we reached the goal, plus the
-
discount factor, another parameter that
-
we set. I'll just use 0.9 times the max Q-value
-
of the new state. So, the max Q-value
-
value of state 15.
-
Since state 15 is a terminal state, it
-
will never get anything other than zeros
-
in this table.
-
The max is zero. Essentially, these two
-
are gone, subtracted by the same thing
-
that we had here.
-
Okay, so work through the math. This
-
is just 0.01,
-
so we get 0.01 here.
-
Now, let's do another one really
-
quickly. The agent is here, takes a right, Q of
-
13, going to the right. 13 is also all
-
zeros. This is the one we're
-
updating. Here is zero, plus the learning
-
rate times the reward. There is no reward,
-
plus the discount factor times max of
-
the new state, max of state
-
14. Max of state 14 is
-
0.01, subtract it by, again, it's zero.
-
This is equal to 0.009.
-
Okay, it's actually pretty
-
straightforward. Now, how the heck does
-
this formula help find a path? So, if we
-
train enough, the theory is that this
-
number is going to be really close to
-
one, and then the states next to it
-
is going to be 0.9 something,
-
and then the states adjacent
-
to that. It's probably going to be 0.8 something.
-
And if we keep going,
-
we can see that a path is,
-
actually two paths
-
are possible. So, mathematically,
-
this is how the path is found.
-
Over at Deep Q-Learning, the
-
formula is going to look like this: We
-
set Q equal to the reward if the new
-
state is a terminal state. Otherwise, we
-
set it to
-
this part
-
of the Q-learning formula.
-
Let's see how the formula is
-
going be used in training.
-
For Deep Q-Learning, we actually
-
need two neural networks. Let me walk
-
through the steps the network on the
-
left is called the policy Network this
-
is the network that we're going to do
-
the training on the one on the right is
-
called the target Network the target
-
network is the one that makes use of the
-
DQ learning formula now let's walk
-
through the steps of training step one
-
is going to be creation of this policy
-
Network step two we make a copy of the
-
policy Network into the target Network
-
so basically we're copying this the
-
weights and the bias over here so both
-
networks are identical step number three
-
the agent navigates the map as usual
-
let's say the agent is here in state 14
-
and it's going into State 15 step three
-
is just navigation now step four we
-
input State 14 remember how we have to
-
encode the input it's going to look like
-
this 0 1 2
-
3 12
-
13 State 14 15 the input is going to
-
look like this as you know neural
-
networks when it's created it comes with
-
a random set of weights and bias so with
-
this input we're actually going to get
-
an output even though these values are
-
pretty much
-
meaningless so we might get some stuff
-
that looks
-
like I'm just putting in some random
-
numbers okay so these Q values are
-
meaningless and as a reminder this is
-
the left action down right and up okay
-
step five
-
we do the same thing we take the exact
-
same input State 14 and send it into the
-
target Network which will also
-
calculate the exact same numbers
-
because the target network is the same
-
as the policy Network
-
currently step six this is where we
-
calculate the Q value for State 14 we're
-
taking the action of
-
two is equal to since we're going into
-
State 15 it is a terminal
-
State we set it equal
-
to
-
1 Step seven step seven is to set the
-
target input State 14 output action two
-
is this
-
node we take the value that we
-
calculated up on step six and we replace
-
place the Q value in the output step
-
number eight we take the target Q
-
values we use it to train the policy
-
Network so this value is the one that's
-
really going to change as you know with
-
neural networks it doesn't go straight
-
to one it's going to go toward that
-
direction so maybe maybe it'll go to
-
0.01 but if you repeat the training many
-
many times it will approach One Step
-
nine is to repeat the whole thing
-
again of course we're not going to
-
create the policy Network we not going
-
to make a copy of it so what we're
-
repeating is steps three through 8 step
-
10 after a certain number of steps or
-
episodes we're going to sync the policy
-
network with the target Network which
-
means we're going to make them identical
-
by copying the weight and biases from
-
the policy over to the Target Network
-
after syncing the networks we're going
-
to repeat nine again and then repeat 10
-
until training is done okay so that's
-
generally how deep que Learning Works um
-
that might have been a little bit
-
complicated so maybe rewind the video
-
watch it again and make sure you
-
understand what's
-
happening to effectively train a neural
-
network we need to randomize the
-
training data that we send into the
-
neural network however if you remember
-
from steps three and four the agent
-
takes an action and then we're sending
-
that training data into the neur network
-
so the question is how do we how do we
-
randomize the order of a single sample
-
but that's where experience replay comes
-
in we need a step 3A where we memorize
-
the agent's experience as the agent
-
navigates the map rest storing the state
-
that it was in what action it took the
-
new state that it reached if there was a
-
reward or not and and if the new state
-
is a terminal state or not we take that
-
and insert it into the memory and the
-
memory is nothing more than a python
-
deck how a deck works is that as the
-
deck gets full whatever at the end gets
-
purged we need a step 3B this is the
-
replay step this is where we take say 30
-
random samples from the deck and then
-
pass it on to step four for training so
-
that's what experience replay
-
is before we jump into the code if you
-
have problems installing gymnasium
-
especially on Windows I got a video for
-
that I'm not going to walk through the Q
-
learning code but if you are interested
-
in that jump to my Q learning Code
-
walkthrough video after watching this
-
one also if you have zero experience
-
with neural networks you can check out
-
this basic tutorial on neuron networks
-
you o need pytorch so head over to PYT
-
to.org and get that installed the first
-
thing that we do is create the class to
-
represent our deep Q Network I'm calling
-
it dqn as mentioned before a deep Q
-
network is nothing more than a feed
-
forward node Network so there's really
-
nothing special about it here I'm using
-
pretty standard way of creating a node
-
network using pytorch if you look up any
-
pytorch tutorial you'll probably see
-
something just just like this so since
-
this is not a py torch tutorial I'm not
-
going to spend too much time explaining
-
how py torch works for this class we
-
have to inherit the neuron network
-
module which requires us to implement
-
two functions the inet function and the
-
forward function in the init function
-
I'm passing in the number of nodes in my
-
input State hidden layer and output
-
State back at our diagram we're going to
-
have 16 input States as mentioned before
-
I'm using 16 in the hidden layer that's
-
something that you can adjust yourself
-
and in the output layer four
-
notes we declare the hidden layer which
-
has 16 going into 16 and then the output
-
layer 16 going into four and then the
-
forward function X is the training data
-
set and we're sending the training data
-
through the neuron Network again this is
-
pretty common pytorch
-
code next we need a class to represent
-
the replay
-
memory so it's this portion that we're
-
implementing right
-
now in the init function we'll pass in a
-
max length and then create the python
-
deck in the append function we're going
-
to append the transition the transition
-
is this tupo here State action new state
-
reward and
-
terminated these sample function will
-
return a random sample of whatever size
-
we want from the memory and then the
-
link function simply Returns the length
-
of the
-
memory the fen L dql class is where
-
we're going to do our
-
training we set the learning rate and
-
the discount Factor uh those are part of
-
the Q learning formula and these are
-
values that you can adjust and play
-
around
-
with the network sync rate it's the
-
number of steps the agent takes before
-
syncing the policy and Target
-
Network that's the setting for step 10
-
Where We sync the policy and the target
-
Network we set the replay memory size to
-
1,000 and the replay memory sample size
-
to 32 these are also numbers that you
-
can uh play around with next is the loss
-
function and the optimizer these two are
-
py torch variables for the loss function
-
I'm simply using the mean square error
-
function for the optimizer will
-
initialize that at a later time this
-
actions list is to Simply map the action
-
numbers into letters for
-
printing in the train function we can
-
specify how many episodes we want to
-
train the agent whether we want to
-
render the map on screen and what do we
-
want to turn on the slippery
-
flag we'll instantiate the frosen lake
-
environment and then create a variable
-
to store the number of states this is
-
going to be 16 inst store the number of
-
actions this is going to be four here we
-
initialize Epsilon to one and we
-
instantiate the replay memory this is
-
going to be size of uh 1,000 now we
-
create the policy DEQ
-
Network this is step one create the
-
policy
-
Network we also create a Target Network
-
and copy the weights and bias from the
-
policy Network into the target Network
-
so that
-
is Step number two making the Target and
-
policy Network
-
identical before training we'll do a
-
print out of the policy Network just so
-
we can compare it to the end result next
-
we initialize the optimizer that we
-
declared
-
earlier and we're simply using the atom
-
Optimizer passing in the learning rate
-
we'll use this rewards per episode list
-
to keep track of the rewards collected
-
per
-
episode we'll also use this Epsilon
-
history list to keep track of Epsilon
-
decaying over
-
time we'll use step count to keep track
-
of the number of steps taken this is
-
used to determine when to sync the
-
policy and Target Network again so that
-
is for step 10 the
-
syncing next we will Loop through the
-
number of episodes that were specified
-
when calling the train function we'll
-
initialize the map so the agent starts
-
at state zero we'll initialize the
-
terminated and truncated flag terminated
-
is when when the agent falls into the
-
hole or reaches the goal truncated is
-
when the agent takes more than 200
-
actions on such a small map 4x4 uh this
-
is probably never going to occur but
-
we'll keep this anyway now we have a
-
while loop checking for the terminated
-
and truncated flag keep looping until
-
these conditions are met this part is
-
basically the Epsilon greedy algorithm
-
if a random number is less than Epsilon
-
we'll just pick a random action
-
otherwise we'll use the policy Network
-
to calculate the set of Q values take
-
the maximum of that extract that item
-
and that's going to be the best action
-
so we have a function here called state
-
to dqn input let's jump down there
-
remember that we have to take the state
-
and encode it that's what this function
-
does this type of encoding
-
here okay let's go back
-
with P torch if we want to perform a
-
prediction we should call this torch. no
-
grads so that it doesn't calculate the
-
stuff needed for training after
-
selecting the either a random action or
-
the best action we'll call the step
-
function to execute the action when we
-
take that action it's going to return
-
the new state whether there is a reward
-
or not whether it's a terminal state or
-
we got truncated we take all that
-
information
-
and put it into our memory so that is
-
Step 3A when we did the Epsilon greedy
-
that was the step
-
three after executing the action we're
-
resetting the state equal to the new
-
state we increment our step counter if
-
we received an award put it on our list
-
now we'll check the memory to see if we
-
have enough training data to to do
-
optimization on also we want to check if
-
we have collected at least one reward if
-
we haven't collected any rewards there's
-
really no point in optimizing the
-
network if those conditions are met we
-
use memory. sample and we pass in uh the
-
batch size which was 32 and we get a
-
batch of training data out of the memory
-
we'll pass that training data into the
-
optimized function along with the policy
-
Network and the target Network
-
let's jump over to the optimize
-
function this first line is just looking
-
at the policy Network and getting the
-
number of input noes we expect this to
-
be 16 the current qist and Target
-
qist so this is the current
-
cus this is the target
-
cus what I'm about to describe now is
-
Step 3B rep playing the experience
-
so to replay the experience We're
-
looping through the training data inside
-
the mini batch let's jump down here for
-
a moment so this is Step number four
-
we're taking the states and passing it
-
into the policy Network to calculate the
-
current list of Q
-
values step number four the output is
-
the list of Q
-
values step number five we pass in the
-
same thing to the Target
-
Network step five we get the Q values
-
out here which should be the same as the
-
Q values from the policy
-
Network step number six is up here so
-
reminder of what it looks like step six
-
is using this formula
-
here so that's what we have here if
-
terminated if terminated just return the
-
reward otherwise use that that second
-
formula so now that we have a Target
-
we'll go to step seven step seven is
-
taking the output of Step six and
-
replacing the respective Q
-
value and that's what we're doing here
-
we're replacing the Q value of that
-
particular action with the Target that
-
was calculated up
-
above step number eight step eight is to
-
take the target values and use that to
-
train the current Q values
-
so that's what we have here we're using
-
the loss function pass in the current
-
set of Q values plus the target set of Q
-
values and then this is just standard Pi
-
torch code to optimize the policy
-
Network now step nine step nine is to
-
repeat steps 3 to 8 starting from
-
navigation to optimizing the
-
network
-
basically that's just continuing this
-
inner while loop of stepping through the
-
states and this outer for Loop of
-
stepping through each
-
episode step 10 is where we sync the
-
policy Network and the target
-
Network and we do that down here if the
-
number of steps taken is greater than
-
the network sync rate that we set then
-
we copy the policy Network into the
-
target Network and then we reset the
-
step counter also after each episode we
-
should be decaying the Epsilon value
-
after syncing the
-
network we repeat Step n again which is
-
repeating basically the navigation and
-
training
-
again so it's just going to go back up
-
here and do this all over again all the
-
way until we have finished the number of
-
episodes so that was training after
-
training we close the environment we can
-
save the policy or the weights and bias
-
into a file I'm hardcoding it here you
-
can definitely make this more Dynamic
-
here I'm creating a new graph and I'm
-
basically graphing the rewards collected
-
per episode also I'm graphing the
-
Epsilon history here after graphing I'm
-
saving the graphs into an image all
-
right so that was the train function we
-
have the talked about the optimize
-
function let me fold
-
that we already looked at that now the
-
test function is going to run the fren
-
Lake environment with the policy that we
-
learned from the train function we can
-
also pass in the number of episodes and
-
whether we want to turn on slippery or
-
not so the rest of the code is going to
-
look pretty similar to what we had in
-
the train function we instantiate the
-
environment get the number of states and
-
action s 16 and 4 here we declar the
-
policy Network load from the file that
-
we saved from training this is just P
-
torch code to switch the policy Network
-
to prediction mode or evaluation mode
-
rather than training mode we'll print
-
the train policy and then we'll Loop
-
over the episodes we set the agent up
-
top you've seen this before keep looping
-
until the agent gets terminated or
-
truncated here we're selecting the best
-
action out of the policy Network and
-
executing the action and then close the
-
environment so ideally when we run this
-
we're going to see the agent navigate
-
the map and reach the goal however if
-
slippery is on there is no guarantee
-
that the agent is going to solve it in
-
one try it might take a couple of tries
-
for the agent to solve the
-
map all right down at the main function
-
we create an instance of the fen Lake
-
dql class first first we're going to try
-
non-slippery we'll train it for a th
-
times or 1 th000 episodes we'll run the
-
test four times just because the map
-
pops up and goes away when the agent
-
gets to the goal so um we want to run it
-
a couple times so we actually can see it
-
on the screen okay I'm just hitting
-
contrl
-
F5 all right training is done and looks
-
like the agent is able to solve the map
-
we also have a print out of the policy
-
that it's learned I print it in a way
-
that matches up with the grid let's jump
-
back to this in a second let's go to our
-
files so after training we have a new
-
graph on the left side that's the number
-
of rewards collected over the 1,000
-
episodes as you can see over time it's
-
improving and finally at the end it's
-
getting a lot of rewards on the right
-
side we can see Epsilon decaying
-
starting from one slowly slowly down to
-
zero also another file that gets created
-
is it's a binary file so we can't really
-
display it but it contains the weights
-
and biases of the policy
-
Network so if we want to do the test
-
again we don't need the training we can
-
comment out the training line and just
-
run this
-
again and we can see the agent solve the
-
map
-
we can look at what policy the agent
-
learned the way this is printed it
-
matches up with the map so at state zero
-
the best action was to go right which is
-
State one at State one the best action
-
is going to go right again at state two
-
go down at St six go down again at St 10
-
go down again at State 14
-
go to the right
-
okay so it was right right down down
-
down right so remember earlier that
-
there were two possible path the one
-
that we actually learned plus the one
-
going down because Epsilon greedy has a
-
Randomness Factor whether the agent
-
learns the top path or the bottom path
-
is just somewhat random if weet train
-
again maybe it'll go down the next time
-
now let's turn the slippery Factor
-
on uncomment the training
-
line now with slippery on we expect the
-
agent to fail a lot more often or to
-
fall in the holes a lot more often let
-
me just triple the number of um
-
training and let me do 10 times for
-
testing okay I'm running this again
-
again with slippery Turn on there's no
-
guarantee that after training the agent
-
is going to be able to pass every
-
episode let's make it come
-
up you see it trying to go to the bottom
-
right but because of slippery it just
-
failed right there but there you go it
-
was able to solve it let's look at the
-
graph the results compared to the
-
non-slippery surface is significantly
-
worse also you want to be careful
-
getting stuck in spots like these which
-
means if I was unlucky enough to set my
-
training episodes to maybe 2200 and it
-
gets stuck here which means it learned a
-
bad policy it won't be able to solve the
-
map so in your training if you're
-
finding that the agent is not able to
-
solve the map when slippery is turned on
-
it might be because of things like these
-
okay that concludes our deep Q learning
-
tutorial I love to hear feedback from
-
you was the explanation easy to
-
understand what can I improve on and
-
what other topics are you interested in