-
hey everyone welcome to my deep Q
-
learning tutorial here's what to expect
-
from this video since deep Q learning is
-
a little bit complicated to explain I'm
-
going to use the Frozen Lake reinforcement
-
learning environment it's a very simple
-
environment so I'm going to do a quick
-
intro on how the environment works or
-
also quickly answer the question of why
-
we need reinforcement learning on such a
-
simple environment next to navigate the
-
environment we need to use the Epsilon
-
greedy algorithm both Q learning and
-
deep Q learning uses the same
-
algorithm after we know how to navigate
-
the environment we're going to take a
-
look at the differences of the output
-
between Q learning and deep Q learning
-
in Q learning we're training a q table
-
in deep Q learning we're training a deep
-
Q Network we'll work through how the Q
-
table is trained then we can see how it
-
is different from training a deep Q
-
Network in training a deep Q Network we
-
also need a technique called EXP
-
experience replay and after we have an
-
idea of how deep Q learning works I'm
-
going to walk through the code and also
-
run it and demo
-
it just in case you're not familiar with
-
how this environment Works quick recap
-
so the goal is to get the learning agent
-
to figure out how to get to the goal on
-
the bottom right the actions that the
-
agent can take is left internity we're
-
going to represent that as zero down is
-
one right is two up is three in this
-
case if the agent tries to go left or up
-
it's just going to stay in place so in
-
general if it tries to go off the grid
-
it's going to stay in the same spot
-
internally we're going to represent each
-
state as this one is zero this one is 1
-
2
-
3 uh 4 5 6 all the way to
-
14 15 so 16 States total each attempt at
-
navigating the map is considered one
-
episode the episode ends if the the
-
agent falls into one of these holes or
-
it reaches the goal when it reaches the
-
goal it will receive a reward of one all
-
other states have no reward or penalty
-
okay so that's how this environment
-
works at this point you might be
-
wondering why we would need
-
reinforcement learning to solve such a
-
simple map why not just use a
-
pathfinding algorithm there is a Twist
-
to this environment there is a flag
-
called is
-
Slippery When set to
-
true the agent doesn't always execute
-
the action that it intends to for
-
example if the agent wants to go right
-
there's only a oneir chance that it will
-
execute this action and there's a two3
-
chance of it going in an adjacent
-
Direction now with slippery turn on
-
pathf finding algorithm will not be able
-
to solve this in fact I'm not even sure
-
what algorithm can that's where
-
reinforcement learning comes in if I
-
don't know how to solve this I can just
-
give the agent some incentive and let it
-
figure out how to solve this so that's
-
where reinforcement learning
-
shines in terms of how the agent
-
navigates the map both deep Q learning
-
and Q learning uses the Epsilon greedy
-
algorithm basically we start with a
-
variable call Epsilon equal and set it
-
equal to one and then we generate a
-
random number if the random number is
-
less than Epsilon we pick a random
-
action otherwise we pick the best action
-
that we know of at the moment and at the
-
end of each episode we'll decrease
-
Epsilon by a little bit at a time so
-
essentially we start off with 100%
-
random exploration and then eventually
-
near the end of training we're going to
-
be always selecting the best action to
-
take before we jump into the details of
-
how the training Works between QQ
-
learning versus deep Q learning let's
-
take a look at what the outcome looks
-
like for Q learning the output the
-
output of the training is a q table
-
which is nothing more than a
-
two-dimensional array consisting of 16
-
states by four actions after training
-
the whole table is going to be filled
-
with Q values for example it might look
-
something like
-
this so in this case the agent is at
-
state zero we look at the table and see
-
what the maximum value is in this case
-
it's to go right so this is the
-
prescribed action over at Deep Q
-
learning the output is a deep Q Network
-
which is actually nothing more than a
-
regular feedforward Nur Network this is
-
actually what it looks like but I think
-
we should switch back to the simplified
-
view the input layer is going to have 16
-
nodes the output layer is going to have
-
four
-
nodes the way that we Center input into
-
the input layer is like this if the
-
agent is at state zero we're going to
-
set the first note to one and everything
-
else
-
zero if the agent is at State one then
-
the first Noe is zero and the second Noe
-
is one and everything else is zero so
-
this is called one hot encoding just do
-
one more if it's at the
-
last if we want to put in state 15 then
-
everything all zeros and state 15 is one
-
so that's how the input works with the
-
input the output that gets calculated
-
are Q values now the Q values are not
-
going to look like what's in the Q table
-
but it might be something similar let me
-
throw some numbers in here
-
okay so same thing for this particular
-
input the best action is the highest Q
-
value when training a neural network
-
essentially we're trying to train the
-
weights associated with each one of
-
these lines in the neural network and
-
the bias for all the hidden layers now
-
how do we know how many hidden layers we
-
need in this case one layer was enough
-
but you can certainly add more if
-
necessary and how many nodes do we need
-
in the hidden layer I try 16 and it's
-
able to solve the map so I'm sticking
-
with that but you can certainly increase
-
or decrease this number to see what
-
happens okay so that's the differences
-
between the output of Q learning versus
-
deep Q
-
learning as the agent navigates the map
-
it is using the Q learning formula to
-
calculate the Q values and update the Q
-
table now the formula might look a
-
little bit scary but it's actually not
-
that bad let's work through some
-
examples let's say our agent is in state
-
14 and it's going right to get to State
-
15 we're calculating Q of State 14
-
Action 2 which is this cell the current
-
value of that cell is zero because we
-
initialize the whole Q table to zero at
-
the beginning plus the learning rate the
-
learning rate is a hyperparameter that
-
we can set as an example I'll just put
-
in
-
0.01
-
times the reward we get a reward of one
-
because we reach the goal plus the
-
discount Factor another parameter that
-
we set I'll just use 0.9 times the max Q
-
value of the new state so the max Q
-
value of State 15
-
since State 15 is a terminal state it
-
will never get anything other than zeros
-
in this
-
table Max is zero essentially these two
-
are gone subtracted by the same thing
-
that we had
-
here okay so work through the math this
-
is just
-
0.01 so we get 0.01
-
here now let's do another one really
-
quickly agent is here take a right Q of
-
13 Going to the right 13 is also all
-
zeros this is the one we're
-
updating here is zero plus the learning
-
rate times a reward there is no reward
-
plus the discount Factor times Max of
-
the new state Max of State
-
14 Max of State 14 is
-
0.01 subtracted by again it's
-
zero this is equal to
-
0.9 okay it's actually pretty
-
straightforward now how the heck does
-
this formula help find the PATH so if we
-
train enough the theory is that this
-
number is going to be really close to
-
one and then the states next to
-
it it's going to be 0.9
-
something and then the states adjacent
-
to that it's probably going to be 0.8
-
[Music]
-
something and if we keep
-
going we can see that a path is
-
is actually two
-
paths are possible so mathematically
-
this is how the path is
-
found over at Deep Q learning the
-
formula is going to look like this we
-
set Q equal to the reward if the new
-
state is a terminal State otherwise we
-
set it
-
to this
-
part
-
of the KE learning
-
formula let's see how the formula is
-
going be used in
-
training for DEQ learning we actually
-
need two neural networks let me walk
-
through the steps the network on the
-
left is called the policy Network this
-
is the network that we're going to do
-
the training on the one on the right is
-
called the target Network the target
-
network is the one that makes use of the
-
DQ learning formula now let's walk
-
through the steps of training step one
-
is going to be creation of this policy
-
Network step two we make a copy of the
-
policy Network into the target Network
-
so basically we're copying this the
-
weights and the bias over here so both
-
networks are identical step number three
-
the agent navigates the map as usual
-
let's say the agent is here in state 14
-
and it's going into State 15 step three
-
is just navigation now step four we
-
input State 14 remember how we have to
-
encode the input it's going to look like
-
this 0 1 2
-
3 12
-
13 State 14 15 the input is going to
-
look like this as you know neural
-
networks when it's created it comes with
-
a random set of weights and bias so with
-
this input we're actually going to get
-
an output even though these values are
-
pretty much
-
meaningless so we might get some stuff
-
that looks
-
like I'm just putting in some random
-
numbers okay so these Q values are
-
meaningless and as a reminder this is
-
the left action down right and up okay
-
step five
-
we do the same thing we take the exact
-
same input State 14 and send it into the
-
target Network which will also
-
calculate the exact same numbers
-
because the target network is the same
-
as the policy Network
-
currently step six this is where we
-
calculate the Q value for State 14 we're
-
taking the action of
-
two is equal to since we're going into
-
State 15 it is a terminal
-
State we set it equal
-
to
-
1 Step seven step seven is to set the
-
target input State 14 output action two
-
is this
-
node we take the value that we
-
calculated up on step six and we replace
-
place the Q value in the output step
-
number eight we take the target Q
-
values we use it to train the policy
-
Network so this value is the one that's
-
really going to change as you know with
-
neural networks it doesn't go straight
-
to one it's going to go toward that
-
direction so maybe maybe it'll go to
-
0.01 but if you repeat the training many
-
many times it will approach One Step
-
nine is to repeat the whole thing
-
again of course we're not going to
-
create the policy Network we not going
-
to make a copy of it so what we're
-
repeating is steps three through 8 step
-
10 after a certain number of steps or
-
episodes we're going to sync the policy
-
network with the target Network which
-
means we're going to make them identical
-
by copying the weight and biases from
-
the policy over to the Target Network
-
after syncing the networks we're going
-
to repeat nine again and then repeat 10
-
until training is done okay so that's
-
generally how deep que Learning Works um
-
that might have been a little bit
-
complicated so maybe rewind the video
-
watch it again and make sure you
-
understand what's
-
happening to effectively train a neural
-
network we need to randomize the
-
training data that we send into the
-
neural network however if you remember
-
from steps three and four the agent
-
takes an action and then we're sending
-
that training data into the neur network
-
so the question is how do we how do we
-
randomize the order of a single sample
-
but that's where experience replay comes
-
in we need a step 3A where we memorize
-
the agent's experience as the agent
-
navigates the map rest storing the state
-
that it was in what action it took the
-
new state that it reached if there was a
-
reward or not and and if the new state
-
is a terminal state or not we take that
-
and insert it into the memory and the
-
memory is nothing more than a python
-
deck how a deck works is that as the
-
deck gets full whatever at the end gets
-
purged we need a step 3B this is the
-
replay step this is where we take say 30
-
random samples from the deck and then
-
pass it on to step four for training so
-
that's what experience replay
-
is before we jump into the code if you
-
have problems installing gymnasium
-
especially on Windows I got a video for
-
that I'm not going to walk through the Q
-
learning code but if you are interested
-
in that jump to my Q learning Code
-
walkthrough video after watching this
-
one also if you have zero experience
-
with neural networks you can check out
-
this basic tutorial on neuron networks
-
you o need pytorch so head over to PYT
-
to.org and get that installed the first
-
thing that we do is create the class to
-
represent our deep Q Network I'm calling
-
it dqn as mentioned before a deep Q
-
network is nothing more than a feed
-
forward node Network so there's really
-
nothing special about it here I'm using
-
pretty standard way of creating a node
-
network using pytorch if you look up any
-
pytorch tutorial you'll probably see
-
something just just like this so since
-
this is not a py torch tutorial I'm not
-
going to spend too much time explaining
-
how py torch works for this class we
-
have to inherit the neuron network
-
module which requires us to implement
-
two functions the inet function and the
-
forward function in the init function
-
I'm passing in the number of nodes in my
-
input State hidden layer and output
-
State back at our diagram we're going to
-
have 16 input States as mentioned before
-
I'm using 16 in the hidden layer that's
-
something that you can adjust yourself
-
and in the output layer four
-
notes we declare the hidden layer which
-
has 16 going into 16 and then the output
-
layer 16 going into four and then the
-
forward function X is the training data
-
set and we're sending the training data
-
through the neuron Network again this is
-
pretty common pytorch
-
code next we need a class to represent
-
the replay
-
memory so it's this portion that we're
-
implementing right
-
now in the init function we'll pass in a
-
max length and then create the python
-
deck in the append function we're going
-
to append the transition the transition
-
is this tupo here State action new state
-
reward and
-
terminated these sample function will
-
return a random sample of whatever size
-
we want from the memory and then the
-
link function simply Returns the length
-
of the
-
memory the fen L dql class is where
-
we're going to do our
-
training we set the learning rate and
-
the discount Factor uh those are part of
-
the Q learning formula and these are
-
values that you can adjust and play
-
around
-
with the network sync rate it's the
-
number of steps the agent takes before
-
syncing the policy and Target
-
Network that's the setting for step 10
-
Where We sync the policy and the target
-
Network we set the replay memory size to
-
1,000 and the replay memory sample size
-
to 32 these are also numbers that you
-
can uh play around with next is the loss
-
function and the optimizer these two are
-
py torch variables for the loss function
-
I'm simply using the mean square error
-
function for the optimizer will
-
initialize that at a later time this
-
actions list is to Simply map the action
-
numbers into letters for
-
printing in the train function we can
-
specify how many episodes we want to
-
train the agent whether we want to
-
render the map on screen and what do we
-
want to turn on the slippery
-
flag we'll instantiate the frosen lake
-
environment and then create a variable
-
to store the number of states this is
-
going to be 16 inst store the number of
-
actions this is going to be four here we
-
initialize Epsilon to one and we
-
instantiate the replay memory this is
-
going to be size of uh 1,000 now we
-
create the policy DEQ
-
Network this is step one create the
-
policy
-
Network we also create a Target Network
-
and copy the weights and bias from the
-
policy Network into the target Network
-
so that
-
is Step number two making the Target and
-
policy Network
-
identical before training we'll do a
-
print out of the policy Network just so
-
we can compare it to the end result next
-
we initialize the optimizer that we
-
declared
-
earlier and we're simply using the atom
-
Optimizer passing in the learning rate
-
we'll use this rewards per episode list
-
to keep track of the rewards collected
-
per
-
episode we'll also use this Epsilon
-
history list to keep track of Epsilon
-
decaying over
-
time we'll use step count to keep track
-
of the number of steps taken this is
-
used to determine when to sync the
-
policy and Target Network again so that
-
is for step 10 the
-
syncing next we will Loop through the
-
number of episodes that were specified
-
when calling the train function we'll
-
initialize the map so the agent starts
-
at state zero we'll initialize the
-
terminated and truncated flag terminated
-
is when when the agent falls into the
-
hole or reaches the goal truncated is
-
when the agent takes more than 200
-
actions on such a small map 4x4 uh this
-
is probably never going to occur but
-
we'll keep this anyway now we have a
-
while loop checking for the terminated
-
and truncated flag keep looping until
-
these conditions are met this part is
-
basically the Epsilon greedy algorithm
-
if a random number is less than Epsilon
-
we'll just pick a random action
-
otherwise we'll use the policy Network
-
to calculate the set of Q values take
-
the maximum of that extract that item
-
and that's going to be the best action
-
so we have a function here called state
-
to dqn input let's jump down there
-
remember that we have to take the state
-
and encode it that's what this function
-
does this type of encoding
-
here okay let's go back
-
with P torch if we want to perform a
-
prediction we should call this torch. no
-
grads so that it doesn't calculate the
-
stuff needed for training after
-
selecting the either a random action or
-
the best action we'll call the step
-
function to execute the action when we
-
take that action it's going to return
-
the new state whether there is a reward
-
or not whether it's a terminal state or
-
we got truncated we take all that
-
information
-
and put it into our memory so that is
-
Step 3A when we did the Epsilon greedy
-
that was the step
-
three after executing the action we're
-
resetting the state equal to the new
-
state we increment our step counter if
-
we received an award put it on our list
-
now we'll check the memory to see if we
-
have enough training data to to do
-
optimization on also we want to check if
-
we have collected at least one reward if
-
we haven't collected any rewards there's
-
really no point in optimizing the
-
network if those conditions are met we
-
use memory. sample and we pass in uh the
-
batch size which was 32 and we get a
-
batch of training data out of the memory
-
we'll pass that training data into the
-
optimized function along with the policy
-
Network and the target Network
-
let's jump over to the optimize
-
function this first line is just looking
-
at the policy Network and getting the
-
number of input noes we expect this to
-
be 16 the current qist and Target
-
qist so this is the current
-
cus this is the target
-
cus what I'm about to describe now is
-
Step 3B rep playing the experience
-
so to replay the experience We're
-
looping through the training data inside
-
the mini batch let's jump down here for
-
a moment so this is Step number four
-
we're taking the states and passing it
-
into the policy Network to calculate the
-
current list of Q
-
values step number four the output is
-
the list of Q
-
values step number five we pass in the
-
same thing to the Target
-
Network step five we get the Q values
-
out here which should be the same as the
-
Q values from the policy
-
Network step number six is up here so
-
reminder of what it looks like step six
-
is using this formula
-
here so that's what we have here if
-
terminated if terminated just return the
-
reward otherwise use that that second
-
formula so now that we have a Target
-
we'll go to step seven step seven is
-
taking the output of Step six and
-
replacing the respective Q
-
value and that's what we're doing here
-
we're replacing the Q value of that
-
particular action with the Target that
-
was calculated up
-
above step number eight step eight is to
-
take the target values and use that to
-
train the current Q values
-
so that's what we have here we're using
-
the loss function pass in the current
-
set of Q values plus the target set of Q
-
values and then this is just standard Pi
-
torch code to optimize the policy
-
Network now step nine step nine is to
-
repeat steps 3 to 8 starting from
-
navigation to optimizing the
-
network
-
basically that's just continuing this
-
inner while loop of stepping through the
-
states and this outer for Loop of
-
stepping through each
-
episode step 10 is where we sync the
-
policy Network and the target
-
Network and we do that down here if the
-
number of steps taken is greater than
-
the network sync rate that we set then
-
we copy the policy Network into the
-
target Network and then we reset the
-
step counter also after each episode we
-
should be decaying the Epsilon value
-
after syncing the
-
network we repeat Step n again which is
-
repeating basically the navigation and
-
training
-
again so it's just going to go back up
-
here and do this all over again all the
-
way until we have finished the number of
-
episodes so that was training after
-
training we close the environment we can
-
save the policy or the weights and bias
-
into a file I'm hardcoding it here you
-
can definitely make this more Dynamic
-
here I'm creating a new graph and I'm
-
basically graphing the rewards collected
-
per episode also I'm graphing the
-
Epsilon history here after graphing I'm
-
saving the graphs into an image all
-
right so that was the train function we
-
have the talked about the optimize
-
function let me fold
-
that we already looked at that now the
-
test function is going to run the fren
-
Lake environment with the policy that we
-
learned from the train function we can
-
also pass in the number of episodes and
-
whether we want to turn on slippery or
-
not so the rest of the code is going to
-
look pretty similar to what we had in
-
the train function we instantiate the
-
environment get the number of states and
-
action s 16 and 4 here we declar the
-
policy Network load from the file that
-
we saved from training this is just P
-
torch code to switch the policy Network
-
to prediction mode or evaluation mode
-
rather than training mode we'll print
-
the train policy and then we'll Loop
-
over the episodes we set the agent up
-
top you've seen this before keep looping
-
until the agent gets terminated or
-
truncated here we're selecting the best
-
action out of the policy Network and
-
executing the action and then close the
-
environment so ideally when we run this
-
we're going to see the agent navigate
-
the map and reach the goal however if
-
slippery is on there is no guarantee
-
that the agent is going to solve it in
-
one try it might take a couple of tries
-
for the agent to solve the
-
map all right down at the main function
-
we create an instance of the fen Lake
-
dql class first first we're going to try
-
non-slippery we'll train it for a th
-
times or 1 th000 episodes we'll run the
-
test four times just because the map
-
pops up and goes away when the agent
-
gets to the goal so um we want to run it
-
a couple times so we actually can see it
-
on the screen okay I'm just hitting
-
contrl
-
F5 all right training is done and looks
-
like the agent is able to solve the map
-
we also have a print out of the policy
-
that it's learned I print it in a way
-
that matches up with the grid let's jump
-
back to this in a second let's go to our
-
files so after training we have a new
-
graph on the left side that's the number
-
of rewards collected over the 1,000
-
episodes as you can see over time it's
-
improving and finally at the end it's
-
getting a lot of rewards on the right
-
side we can see Epsilon decaying
-
starting from one slowly slowly down to
-
zero also another file that gets created
-
is it's a binary file so we can't really
-
display it but it contains the weights
-
and biases of the policy
-
Network so if we want to do the test
-
again we don't need the training we can
-
comment out the training line and just
-
run this
-
again and we can see the agent solve the
-
map
-
we can look at what policy the agent
-
learned the way this is printed it
-
matches up with the map so at state zero
-
the best action was to go right which is
-
State one at State one the best action
-
is going to go right again at state two
-
go down at St six go down again at St 10
-
go down again at State 14
-
go to the right
-
okay so it was right right down down
-
down right so remember earlier that
-
there were two possible path the one
-
that we actually learned plus the one
-
going down because Epsilon greedy has a
-
Randomness Factor whether the agent
-
learns the top path or the bottom path
-
is just somewhat random if weet train
-
again maybe it'll go down the next time
-
now let's turn the slippery Factor
-
on uncomment the training
-
line now with slippery on we expect the
-
agent to fail a lot more often or to
-
fall in the holes a lot more often let
-
me just triple the number of um
-
training and let me do 10 times for
-
testing okay I'm running this again
-
again with slippery Turn on there's no
-
guarantee that after training the agent
-
is going to be able to pass every
-
episode let's make it come
-
up you see it trying to go to the bottom
-
right but because of slippery it just
-
failed right there but there you go it
-
was able to solve it let's look at the
-
graph the results compared to the
-
non-slippery surface is significantly
-
worse also you want to be careful
-
getting stuck in spots like these which
-
means if I was unlucky enough to set my
-
training episodes to maybe 2200 and it
-
gets stuck here which means it learned a
-
bad policy it won't be able to solve the
-
map so in your training if you're
-
finding that the agent is not able to
-
solve the map when slippery is turned on
-
it might be because of things like these
-
okay that concludes our deep Q learning
-
tutorial I love to hear feedback from
-
you was the explanation easy to
-
understand what can I improve on and
-
what other topics are you interested in