< Return to Video

Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning

  • 0:00 - 0:02
    hey everyone welcome to my deep Q
  • 0:02 - 0:04
    learning tutorial here's what to expect
  • 0:04 - 0:07
    from this video since deep Q learning is
  • 0:07 - 0:09
    a little bit complicated to explain I'm
  • 0:09 - 0:11
    going to use the Frozen Lake reinforcement
  • 0:11 - 0:14
    learning environment it's a very simple
  • 0:14 - 0:17
    environment so I'm going to do a quick
  • 0:17 - 0:19
    intro on how the environment works or
  • 0:19 - 0:22
    also quickly answer the question of why
  • 0:22 - 0:24
    we need reinforcement learning on such a
  • 0:24 - 0:27
    simple environment next to navigate the
  • 0:27 - 0:29
    environment we need to use the Epsilon
  • 0:29 - 0:32
    greedy algorithm both Q learning and
  • 0:32 - 0:34
    deep Q learning uses the same
  • 0:34 - 0:37
    algorithm after we know how to navigate
  • 0:37 - 0:38
    the environment we're going to take a
  • 0:38 - 0:41
    look at the differences of the output
  • 0:41 - 0:43
    between Q learning and deep Q learning
  • 0:43 - 0:45
    in Q learning we're training a q table
  • 0:45 - 0:48
    in deep Q learning we're training a deep
  • 0:48 - 0:50
    Q Network we'll work through how the Q
  • 0:50 - 0:53
    table is trained then we can see how it
  • 0:53 - 0:55
    is different from training a deep Q
  • 0:55 - 0:58
    Network in training a deep Q Network we
  • 0:58 - 1:00
    also need a technique called EXP
  • 1:00 - 1:02
    experience replay and after we have an
  • 1:02 - 1:05
    idea of how deep Q learning works I'm
  • 1:05 - 1:07
    going to walk through the code and also
  • 1:07 - 1:10
    run it and demo
  • 1:10 - 1:12
    it just in case you're not familiar with
  • 1:12 - 1:14
    how this environment Works quick recap
  • 1:14 - 1:17
    so the goal is to get the learning agent
  • 1:17 - 1:19
    to figure out how to get to the goal on
  • 1:19 - 1:22
    the bottom right the actions that the
  • 1:22 - 1:24
    agent can take is left internity we're
  • 1:24 - 1:28
    going to represent that as zero down is
  • 1:28 - 1:31
    one right is two up is three in this
  • 1:31 - 1:34
    case if the agent tries to go left or up
  • 1:34 - 1:37
    it's just going to stay in place so in
  • 1:37 - 1:39
    general if it tries to go off the grid
  • 1:39 - 1:40
    it's going to stay in the same spot
  • 1:40 - 1:42
    internally we're going to represent each
  • 1:42 - 1:45
    state as this one is zero this one is 1
  • 1:45 - 1:46
    2
  • 1:46 - 1:51
    3 uh 4 5 6 all the way to
  • 1:51 - 1:55
    14 15 so 16 States total each attempt at
  • 1:55 - 1:57
    navigating the map is considered one
  • 1:57 - 2:00
    episode the episode ends if the the
  • 2:00 - 2:04
    agent falls into one of these holes or
  • 2:04 - 2:06
    it reaches the goal when it reaches the
  • 2:06 - 2:10
    goal it will receive a reward of one all
  • 2:10 - 2:13
    other states have no reward or penalty
  • 2:13 - 2:17
    okay so that's how this environment
  • 2:17 - 2:19
    works at this point you might be
  • 2:19 - 2:20
    wondering why we would need
  • 2:20 - 2:23
    reinforcement learning to solve such a
  • 2:23 - 2:25
    simple map why not just use a
  • 2:25 - 2:27
    pathfinding algorithm there is a Twist
  • 2:27 - 2:29
    to this environment there is a flag
  • 2:29 - 2:32
    called is
  • 2:32 - 2:34
    Slippery When set to
  • 2:34 - 2:37
    true the agent doesn't always execute
  • 2:37 - 2:40
    the action that it intends to for
  • 2:40 - 2:42
    example if the agent wants to go right
  • 2:42 - 2:44
    there's only a oneir chance that it will
  • 2:44 - 2:47
    execute this action and there's a two3
  • 2:47 - 2:50
    chance of it going in an adjacent
  • 2:50 - 2:53
    Direction now with slippery turn on
  • 2:53 - 2:55
    pathf finding algorithm will not be able
  • 2:55 - 2:58
    to solve this in fact I'm not even sure
  • 2:58 - 3:00
    what algorithm can that's where
  • 3:00 - 3:02
    reinforcement learning comes in if I
  • 3:02 - 3:04
    don't know how to solve this I can just
  • 3:04 - 3:08
    give the agent some incentive and let it
  • 3:08 - 3:10
    figure out how to solve this so that's
  • 3:10 - 3:13
    where reinforcement learning
  • 3:13 - 3:16
    shines in terms of how the agent
  • 3:16 - 3:18
    navigates the map both deep Q learning
  • 3:18 - 3:20
    and Q learning uses the Epsilon greedy
  • 3:20 - 3:23
    algorithm basically we start with a
  • 3:23 - 3:26
    variable call Epsilon equal and set it
  • 3:26 - 3:28
    equal to one and then we generate a
  • 3:28 - 3:31
    random number if the random number is
  • 3:31 - 3:33
    less than Epsilon we pick a random
  • 3:33 - 3:36
    action otherwise we pick the best action
  • 3:36 - 3:39
    that we know of at the moment and at the
  • 3:39 - 3:41
    end of each episode we'll decrease
  • 3:41 - 3:44
    Epsilon by a little bit at a time so
  • 3:44 - 3:47
    essentially we start off with 100%
  • 3:47 - 3:50
    random exploration and then eventually
  • 3:50 - 3:52
    near the end of training we're going to
  • 3:52 - 3:56
    be always selecting the best action to
  • 3:57 - 4:00
    take before we jump into the details of
  • 4:00 - 4:02
    how the training Works between QQ
  • 4:02 - 4:04
    learning versus deep Q learning let's
  • 4:04 - 4:06
    take a look at what the outcome looks
  • 4:06 - 4:09
    like for Q learning the output the
  • 4:09 - 4:11
    output of the training is a q table
  • 4:11 - 4:13
    which is nothing more than a
  • 4:13 - 4:17
    two-dimensional array consisting of 16
  • 4:17 - 4:20
    states by four actions after training
  • 4:20 - 4:22
    the whole table is going to be filled
  • 4:22 - 4:26
    with Q values for example it might look
  • 4:26 - 4:28
    something like
  • 4:28 - 4:32
    this so in this case the agent is at
  • 4:32 - 4:34
    state zero we look at the table and see
  • 4:34 - 4:37
    what the maximum value is in this case
  • 4:37 - 4:40
    it's to go right so this is the
  • 4:40 - 4:43
    prescribed action over at Deep Q
  • 4:43 - 4:46
    learning the output is a deep Q Network
  • 4:46 - 4:48
    which is actually nothing more than a
  • 4:48 - 4:52
    regular feedforward Nur Network this is
  • 4:52 - 4:54
    actually what it looks like but I think
  • 4:54 - 4:56
    we should switch back to the simplified
  • 4:56 - 5:00
    view the input layer is going to have 16
  • 5:00 - 5:04
    nodes the output layer is going to have
  • 5:04 - 5:05
    four
  • 5:05 - 5:09
    nodes the way that we Center input into
  • 5:09 - 5:12
    the input layer is like this if the
  • 5:12 - 5:14
    agent is at state zero we're going to
  • 5:14 - 5:18
    set the first note to one and everything
  • 5:18 - 5:20
    else
  • 5:20 - 5:25
    zero if the agent is at State one then
  • 5:25 - 5:27
    the first Noe is zero and the second Noe
  • 5:27 - 5:31
    is one and everything else is zero so
  • 5:31 - 5:33
    this is called one hot encoding just do
  • 5:33 - 5:36
    one more if it's at the
  • 5:36 - 5:41
    last if we want to put in state 15 then
  • 5:41 - 5:44
    everything all zeros and state 15 is one
  • 5:44 - 5:47
    so that's how the input works with the
  • 5:47 - 5:50
    input the output that gets calculated
  • 5:50 - 5:53
    are Q values now the Q values are not
  • 5:53 - 5:55
    going to look like what's in the Q table
  • 5:55 - 5:58
    but it might be something similar let me
  • 5:58 - 6:01
    throw some numbers in here
  • 6:02 - 6:05
    okay so same thing for this particular
  • 6:05 - 6:08
    input the best action is the highest Q
  • 6:08 - 6:11
    value when training a neural network
  • 6:11 - 6:14
    essentially we're trying to train the
  • 6:14 - 6:16
    weights associated with each one of
  • 6:16 - 6:20
    these lines in the neural network and
  • 6:20 - 6:24
    the bias for all the hidden layers now
  • 6:24 - 6:26
    how do we know how many hidden layers we
  • 6:26 - 6:29
    need in this case one layer was enough
  • 6:29 - 6:31
    but you can certainly add more if
  • 6:31 - 6:34
    necessary and how many nodes do we need
  • 6:34 - 6:37
    in the hidden layer I try 16 and it's
  • 6:37 - 6:40
    able to solve the map so I'm sticking
  • 6:40 - 6:42
    with that but you can certainly increase
  • 6:42 - 6:44
    or decrease this number to see what
  • 6:44 - 6:47
    happens okay so that's the differences
  • 6:47 - 6:49
    between the output of Q learning versus
  • 6:49 - 6:51
    deep Q
  • 6:52 - 6:55
    learning as the agent navigates the map
  • 6:55 - 6:58
    it is using the Q learning formula to
  • 6:58 - 7:01
    calculate the Q values and update the Q
  • 7:01 - 7:03
    table now the formula might look a
  • 7:03 - 7:04
    little bit scary but it's actually not
  • 7:04 - 7:07
    that bad let's work through some
  • 7:07 - 7:10
    examples let's say our agent is in state
  • 7:10 - 7:14
    14 and it's going right to get to State
  • 7:14 - 7:19
    15 we're calculating Q of State 14
  • 7:19 - 7:24
    Action 2 which is this cell the current
  • 7:24 - 7:27
    value of that cell is zero because we
  • 7:27 - 7:30
    initialize the whole Q table to zero at
  • 7:30 - 7:33
    the beginning plus the learning rate the
  • 7:33 - 7:35
    learning rate is a hyperparameter that
  • 7:35 - 7:38
    we can set as an example I'll just put
  • 7:38 - 7:39
    in
  • 7:39 - 7:40
    0.01
  • 7:40 - 7:45
    times the reward we get a reward of one
  • 7:45 - 7:47
    because we reach the goal plus the
  • 7:47 - 7:50
    discount Factor another parameter that
  • 7:50 - 7:55
    we set I'll just use 0.9 times the max Q
  • 7:55 - 7:58
    value of the new state so the max Q
  • 7:58 - 8:01
    value of State 15
  • 8:01 - 8:04
    since State 15 is a terminal state it
  • 8:04 - 8:06
    will never get anything other than zeros
  • 8:06 - 8:07
    in this
  • 8:07 - 8:11
    table Max is zero essentially these two
  • 8:11 - 8:15
    are gone subtracted by the same thing
  • 8:15 - 8:17
    that we had
  • 8:17 - 8:20
    here okay so work through the math this
  • 8:20 - 8:23
    is just
  • 8:23 - 8:26
    0.01 so we get 0.01
  • 8:26 - 8:29
    here now let's do another one really
  • 8:29 - 8:36
    quickly agent is here take a right Q of
  • 8:36 - 8:42
    13 Going to the right 13 is also all
  • 8:42 - 8:44
    zeros this is the one we're
  • 8:44 - 8:48
    updating here is zero plus the learning
  • 8:48 - 8:53
    rate times a reward there is no reward
  • 8:53 - 8:57
    plus the discount Factor times Max of
  • 8:57 - 9:01
    the new state Max of State
  • 9:01 - 9:05
    14 Max of State 14 is
  • 9:06 - 9:11
    0.01 subtracted by again it's
  • 9:11 - 9:15
    zero this is equal to
  • 9:15 - 9:18
    0.9 okay it's actually pretty
  • 9:18 - 9:20
    straightforward now how the heck does
  • 9:20 - 9:23
    this formula help find the PATH so if we
  • 9:23 - 9:28
    train enough the theory is that this
  • 9:28 - 9:32
    number is going to be really close to
  • 9:32 - 9:35
    one and then the states next to
  • 9:35 - 9:40
    it it's going to be 0.9
  • 9:40 - 9:43
    something and then the states adjacent
  • 9:43 - 9:47
    to that it's probably going to be 0.8
  • 9:47 - 9:50
    [Music]
  • 9:50 - 9:53
    something and if we keep
  • 9:56 - 10:01
    going we can see that a path is
  • 10:01 - 10:04
    is actually two
  • 10:04 - 10:07
    paths are possible so mathematically
  • 10:07 - 10:11
    this is how the path is
  • 10:12 - 10:15
    found over at Deep Q learning the
  • 10:15 - 10:17
    formula is going to look like this we
  • 10:17 - 10:20
    set Q equal to the reward if the new
  • 10:20 - 10:24
    state is a terminal State otherwise we
  • 10:24 - 10:25
    set it
  • 10:25 - 10:28
    to this
  • 10:28 - 10:30
    part
  • 10:30 - 10:32
    of the KE learning
  • 10:32 - 10:34
    formula let's see how the formula is
  • 10:34 - 10:36
    going be used in
  • 10:36 - 10:39
    training for DEQ learning we actually
  • 10:39 - 10:42
    need two neural networks let me walk
  • 10:42 - 10:45
    through the steps the network on the
  • 10:45 - 10:48
    left is called the policy Network this
  • 10:48 - 10:51
    is the network that we're going to do
  • 10:51 - 10:53
    the training on the one on the right is
  • 10:53 - 10:56
    called the target Network the target
  • 10:56 - 10:58
    network is the one that makes use of the
  • 10:58 - 11:00
    DQ learning formula now let's walk
  • 11:00 - 11:03
    through the steps of training step one
  • 11:03 - 11:07
    is going to be creation of this policy
  • 11:07 - 11:11
    Network step two we make a copy of the
  • 11:11 - 11:15
    policy Network into the target Network
  • 11:15 - 11:17
    so basically we're copying this the
  • 11:17 - 11:19
    weights and the bias over here so both
  • 11:19 - 11:22
    networks are identical step number three
  • 11:22 - 11:25
    the agent navigates the map as usual
  • 11:25 - 11:30
    let's say the agent is here in state 14
  • 11:30 - 11:34
    and it's going into State 15 step three
  • 11:34 - 11:37
    is just navigation now step four we
  • 11:37 - 11:40
    input State 14 remember how we have to
  • 11:40 - 11:42
    encode the input it's going to look like
  • 11:42 - 11:48
    this 0 1 2
  • 11:48 - 11:50
    3 12
  • 11:50 - 11:55
    13 State 14 15 the input is going to
  • 11:55 - 11:57
    look like this as you know neural
  • 11:57 - 12:00
    networks when it's created it comes with
  • 12:00 - 12:04
    a random set of weights and bias so with
  • 12:04 - 12:05
    this input we're actually going to get
  • 12:05 - 12:09
    an output even though these values are
  • 12:09 - 12:10
    pretty much
  • 12:10 - 12:13
    meaningless so we might get some stuff
  • 12:13 - 12:15
    that looks
  • 12:15 - 12:18
    like I'm just putting in some random
  • 12:18 - 12:21
    numbers okay so these Q values are
  • 12:21 - 12:24
    meaningless and as a reminder this is
  • 12:24 - 12:28
    the left action down right and up okay
  • 12:28 - 12:30
    step five
  • 12:30 - 12:32
    we do the same thing we take the exact
  • 12:32 - 12:38
    same input State 14 and send it into the
  • 12:38 - 12:41
    target Network which will also
  • 12:41 - 12:44
    calculate the exact same numbers
  • 12:44 - 12:47
    because the target network is the same
  • 12:47 - 12:50
    as the policy Network
  • 12:50 - 12:53
    currently step six this is where we
  • 12:53 - 12:58
    calculate the Q value for State 14 we're
  • 12:58 - 13:00
    taking the action of
  • 13:00 - 13:04
    two is equal to since we're going into
  • 13:04 - 13:08
    State 15 it is a terminal
  • 13:08 - 13:10
    State we set it equal
  • 13:10 - 13:12
    to
  • 13:12 - 13:17
    1 Step seven step seven is to set the
  • 13:17 - 13:22
    target input State 14 output action two
  • 13:22 - 13:23
    is this
  • 13:23 - 13:25
    node we take the value that we
  • 13:25 - 13:29
    calculated up on step six and we replace
  • 13:29 - 13:33
    place the Q value in the output step
  • 13:33 - 13:36
    number eight we take the target Q
  • 13:36 - 13:39
    values we use it to train the policy
  • 13:39 - 13:42
    Network so this value is the one that's
  • 13:42 - 13:44
    really going to change as you know with
  • 13:44 - 13:46
    neural networks it doesn't go straight
  • 13:46 - 13:49
    to one it's going to go toward that
  • 13:49 - 13:51
    direction so maybe maybe it'll go to
  • 13:51 - 13:54
    0.01 but if you repeat the training many
  • 13:54 - 13:58
    many times it will approach One Step
  • 13:58 - 14:02
    nine is to repeat the whole thing
  • 14:02 - 14:05
    again of course we're not going to
  • 14:05 - 14:07
    create the policy Network we not going
  • 14:07 - 14:09
    to make a copy of it so what we're
  • 14:09 - 14:14
    repeating is steps three through 8 step
  • 14:14 - 14:17
    10 after a certain number of steps or
  • 14:17 - 14:20
    episodes we're going to sync the policy
  • 14:20 - 14:22
    network with the target Network which
  • 14:22 - 14:24
    means we're going to make them identical
  • 14:24 - 14:26
    by copying the weight and biases from
  • 14:26 - 14:29
    the policy over to the Target Network
  • 14:29 - 14:31
    after syncing the networks we're going
  • 14:31 - 14:36
    to repeat nine again and then repeat 10
  • 14:36 - 14:39
    until training is done okay so that's
  • 14:39 - 14:42
    generally how deep que Learning Works um
  • 14:42 - 14:43
    that might have been a little bit
  • 14:43 - 14:45
    complicated so maybe rewind the video
  • 14:45 - 14:47
    watch it again and make sure you
  • 14:47 - 14:49
    understand what's
  • 14:49 - 14:52
    happening to effectively train a neural
  • 14:52 - 14:54
    network we need to randomize the
  • 14:54 - 14:57
    training data that we send into the
  • 14:57 - 14:59
    neural network however if you remember
  • 14:59 - 15:02
    from steps three and four the agent
  • 15:02 - 15:05
    takes an action and then we're sending
  • 15:05 - 15:07
    that training data into the neur network
  • 15:07 - 15:09
    so the question is how do we how do we
  • 15:09 - 15:12
    randomize the order of a single sample
  • 15:12 - 15:14
    but that's where experience replay comes
  • 15:14 - 15:19
    in we need a step 3A where we memorize
  • 15:19 - 15:21
    the agent's experience as the agent
  • 15:21 - 15:24
    navigates the map rest storing the state
  • 15:24 - 15:26
    that it was in what action it took the
  • 15:26 - 15:28
    new state that it reached if there was a
  • 15:28 - 15:30
    reward or not and and if the new state
  • 15:30 - 15:33
    is a terminal state or not we take that
  • 15:33 - 15:36
    and insert it into the memory and the
  • 15:36 - 15:39
    memory is nothing more than a python
  • 15:39 - 15:41
    deck how a deck works is that as the
  • 15:41 - 15:46
    deck gets full whatever at the end gets
  • 15:46 - 15:50
    purged we need a step 3B this is the
  • 15:50 - 15:54
    replay step this is where we take say 30
  • 15:54 - 15:57
    random samples from the deck and then
  • 15:57 - 16:00
    pass it on to step four for training so
  • 16:00 - 16:03
    that's what experience replay
  • 16:04 - 16:07
    is before we jump into the code if you
  • 16:07 - 16:09
    have problems installing gymnasium
  • 16:09 - 16:12
    especially on Windows I got a video for
  • 16:12 - 16:15
    that I'm not going to walk through the Q
  • 16:15 - 16:17
    learning code but if you are interested
  • 16:17 - 16:19
    in that jump to my Q learning Code
  • 16:19 - 16:21
    walkthrough video after watching this
  • 16:21 - 16:24
    one also if you have zero experience
  • 16:24 - 16:26
    with neural networks you can check out
  • 16:26 - 16:28
    this basic tutorial on neuron networks
  • 16:28 - 16:31
    you o need pytorch so head over to PYT
  • 16:31 - 16:34
    to.org and get that installed the first
  • 16:34 - 16:36
    thing that we do is create the class to
  • 16:36 - 16:39
    represent our deep Q Network I'm calling
  • 16:39 - 16:42
    it dqn as mentioned before a deep Q
  • 16:42 - 16:45
    network is nothing more than a feed
  • 16:45 - 16:47
    forward node Network so there's really
  • 16:47 - 16:49
    nothing special about it here I'm using
  • 16:49 - 16:51
    pretty standard way of creating a node
  • 16:51 - 16:55
    network using pytorch if you look up any
  • 16:55 - 16:58
    pytorch tutorial you'll probably see
  • 16:58 - 17:00
    something just just like this so since
  • 17:00 - 17:03
    this is not a py torch tutorial I'm not
  • 17:03 - 17:05
    going to spend too much time explaining
  • 17:05 - 17:07
    how py torch works for this class we
  • 17:07 - 17:09
    have to inherit the neuron network
  • 17:09 - 17:12
    module which requires us to implement
  • 17:12 - 17:15
    two functions the inet function and the
  • 17:15 - 17:17
    forward function in the init function
  • 17:17 - 17:20
    I'm passing in the number of nodes in my
  • 17:20 - 17:24
    input State hidden layer and output
  • 17:24 - 17:26
    State back at our diagram we're going to
  • 17:26 - 17:30
    have 16 input States as mentioned before
  • 17:30 - 17:32
    I'm using 16 in the hidden layer that's
  • 17:32 - 17:34
    something that you can adjust yourself
  • 17:34 - 17:36
    and in the output layer four
  • 17:36 - 17:40
    notes we declare the hidden layer which
  • 17:40 - 17:43
    has 16 going into 16 and then the output
  • 17:43 - 17:47
    layer 16 going into four and then the
  • 17:47 - 17:49
    forward function X is the training data
  • 17:49 - 17:51
    set and we're sending the training data
  • 17:51 - 17:53
    through the neuron Network again this is
  • 17:53 - 17:57
    pretty common pytorch
  • 17:57 - 18:00
    code next we need a class to represent
  • 18:00 - 18:02
    the replay
  • 18:02 - 18:05
    memory so it's this portion that we're
  • 18:05 - 18:08
    implementing right
  • 18:08 - 18:12
    now in the init function we'll pass in a
  • 18:12 - 18:14
    max length and then create the python
  • 18:14 - 18:17
    deck in the append function we're going
  • 18:17 - 18:21
    to append the transition the transition
  • 18:21 - 18:24
    is this tupo here State action new state
  • 18:24 - 18:27
    reward and
  • 18:27 - 18:30
    terminated these sample function will
  • 18:30 - 18:33
    return a random sample of whatever size
  • 18:33 - 18:35
    we want from the memory and then the
  • 18:35 - 18:37
    link function simply Returns the length
  • 18:37 - 18:39
    of the
  • 18:39 - 18:42
    memory the fen L dql class is where
  • 18:42 - 18:44
    we're going to do our
  • 18:44 - 18:47
    training we set the learning rate and
  • 18:47 - 18:50
    the discount Factor uh those are part of
  • 18:50 - 18:53
    the Q learning formula and these are
  • 18:53 - 18:55
    values that you can adjust and play
  • 18:55 - 18:56
    around
  • 18:56 - 18:59
    with the network sync rate it's the
  • 18:59 - 19:01
    number of steps the agent takes before
  • 19:01 - 19:05
    syncing the policy and Target
  • 19:05 - 19:08
    Network that's the setting for step 10
  • 19:08 - 19:13
    Where We sync the policy and the target
  • 19:13 - 19:17
    Network we set the replay memory size to
  • 19:17 - 19:22
    1,000 and the replay memory sample size
  • 19:22 - 19:24
    to 32 these are also numbers that you
  • 19:24 - 19:28
    can uh play around with next is the loss
  • 19:28 - 19:31
    function and the optimizer these two are
  • 19:31 - 19:34
    py torch variables for the loss function
  • 19:34 - 19:36
    I'm simply using the mean square error
  • 19:36 - 19:38
    function for the optimizer will
  • 19:38 - 19:41
    initialize that at a later time this
  • 19:41 - 19:44
    actions list is to Simply map the action
  • 19:44 - 19:49
    numbers into letters for
  • 19:49 - 19:52
    printing in the train function we can
  • 19:52 - 19:54
    specify how many episodes we want to
  • 19:54 - 19:57
    train the agent whether we want to
  • 19:57 - 20:00
    render the map on screen and what do we
  • 20:00 - 20:02
    want to turn on the slippery
  • 20:02 - 20:06
    flag we'll instantiate the frosen lake
  • 20:06 - 20:09
    environment and then create a variable
  • 20:09 - 20:12
    to store the number of states this is
  • 20:12 - 20:15
    going to be 16 inst store the number of
  • 20:15 - 20:18
    actions this is going to be four here we
  • 20:18 - 20:21
    initialize Epsilon to one and we
  • 20:21 - 20:24
    instantiate the replay memory this is
  • 20:24 - 20:27
    going to be size of uh 1,000 now we
  • 20:27 - 20:31
    create the policy DEQ
  • 20:31 - 20:34
    Network this is step one create the
  • 20:34 - 20:36
    policy
  • 20:36 - 20:40
    Network we also create a Target Network
  • 20:40 - 20:43
    and copy the weights and bias from the
  • 20:43 - 20:46
    policy Network into the target Network
  • 20:46 - 20:48
    so that
  • 20:48 - 20:52
    is Step number two making the Target and
  • 20:52 - 20:53
    policy Network
  • 20:53 - 20:56
    identical before training we'll do a
  • 20:56 - 20:58
    print out of the policy Network just so
  • 20:58 - 21:02
    we can compare it to the end result next
  • 21:02 - 21:04
    we initialize the optimizer that we
  • 21:04 - 21:05
    declared
  • 21:05 - 21:09
    earlier and we're simply using the atom
  • 21:09 - 21:12
    Optimizer passing in the learning rate
  • 21:12 - 21:15
    we'll use this rewards per episode list
  • 21:15 - 21:18
    to keep track of the rewards collected
  • 21:18 - 21:18
    per
  • 21:18 - 21:21
    episode we'll also use this Epsilon
  • 21:21 - 21:24
    history list to keep track of Epsilon
  • 21:24 - 21:26
    decaying over
  • 21:26 - 21:29
    time we'll use step count to keep track
  • 21:29 - 21:31
    of the number of steps taken this is
  • 21:31 - 21:35
    used to determine when to sync the
  • 21:35 - 21:39
    policy and Target Network again so that
  • 21:39 - 21:41
    is for step 10 the
  • 21:41 - 21:45
    syncing next we will Loop through the
  • 21:45 - 21:47
    number of episodes that were specified
  • 21:47 - 21:49
    when calling the train function we'll
  • 21:49 - 21:52
    initialize the map so the agent starts
  • 21:52 - 21:55
    at state zero we'll initialize the
  • 21:55 - 21:58
    terminated and truncated flag terminated
  • 21:58 - 21:59
    is when when the agent falls into the
  • 21:59 - 22:02
    hole or reaches the goal truncated is
  • 22:02 - 22:05
    when the agent takes more than 200
  • 22:05 - 22:08
    actions on such a small map 4x4 uh this
  • 22:08 - 22:10
    is probably never going to occur but
  • 22:10 - 22:12
    we'll keep this anyway now we have a
  • 22:12 - 22:14
    while loop checking for the terminated
  • 22:14 - 22:18
    and truncated flag keep looping until
  • 22:18 - 22:20
    these conditions are met this part is
  • 22:20 - 22:23
    basically the Epsilon greedy algorithm
  • 22:23 - 22:27
    if a random number is less than Epsilon
  • 22:27 - 22:29
    we'll just pick a random action
  • 22:29 - 22:31
    otherwise we'll use the policy Network
  • 22:31 - 22:34
    to calculate the set of Q values take
  • 22:34 - 22:37
    the maximum of that extract that item
  • 22:37 - 22:39
    and that's going to be the best action
  • 22:39 - 22:42
    so we have a function here called state
  • 22:42 - 22:46
    to dqn input let's jump down there
  • 22:46 - 22:48
    remember that we have to take the state
  • 22:48 - 22:52
    and encode it that's what this function
  • 22:52 - 22:56
    does this type of encoding
  • 22:56 - 23:00
    here okay let's go back
  • 23:00 - 23:02
    with P torch if we want to perform a
  • 23:02 - 23:05
    prediction we should call this torch. no
  • 23:05 - 23:07
    grads so that it doesn't calculate the
  • 23:07 - 23:09
    stuff needed for training after
  • 23:09 - 23:12
    selecting the either a random action or
  • 23:12 - 23:14
    the best action we'll call the step
  • 23:14 - 23:17
    function to execute the action when we
  • 23:17 - 23:20
    take that action it's going to return
  • 23:20 - 23:22
    the new state whether there is a reward
  • 23:22 - 23:25
    or not whether it's a terminal state or
  • 23:25 - 23:27
    we got truncated we take all that
  • 23:27 - 23:29
    information
  • 23:29 - 23:33
    and put it into our memory so that is
  • 23:33 - 23:38
    Step 3A when we did the Epsilon greedy
  • 23:38 - 23:41
    that was the step
  • 23:41 - 23:45
    three after executing the action we're
  • 23:45 - 23:47
    resetting the state equal to the new
  • 23:47 - 23:50
    state we increment our step counter if
  • 23:50 - 23:54
    we received an award put it on our list
  • 23:54 - 23:56
    now we'll check the memory to see if we
  • 23:56 - 23:58
    have enough training data to to do
  • 23:58 - 24:02
    optimization on also we want to check if
  • 24:02 - 24:05
    we have collected at least one reward if
  • 24:05 - 24:07
    we haven't collected any rewards there's
  • 24:07 - 24:09
    really no point in optimizing the
  • 24:09 - 24:12
    network if those conditions are met we
  • 24:12 - 24:15
    use memory. sample and we pass in uh the
  • 24:15 - 24:18
    batch size which was 32 and we get a
  • 24:18 - 24:21
    batch of training data out of the memory
  • 24:21 - 24:23
    we'll pass that training data into the
  • 24:23 - 24:26
    optimized function along with the policy
  • 24:26 - 24:29
    Network and the target Network
  • 24:29 - 24:32
    let's jump over to the optimize
  • 24:32 - 24:35
    function this first line is just looking
  • 24:35 - 24:37
    at the policy Network and getting the
  • 24:37 - 24:39
    number of input noes we expect this to
  • 24:39 - 24:44
    be 16 the current qist and Target
  • 24:44 - 24:47
    qist so this is the current
  • 24:47 - 24:51
    cus this is the target
  • 24:51 - 24:55
    cus what I'm about to describe now is
  • 24:55 - 25:00
    Step 3B rep playing the experience
  • 25:00 - 25:02
    so to replay the experience We're
  • 25:02 - 25:05
    looping through the training data inside
  • 25:05 - 25:08
    the mini batch let's jump down here for
  • 25:08 - 25:12
    a moment so this is Step number four
  • 25:12 - 25:14
    we're taking the states and passing it
  • 25:14 - 25:17
    into the policy Network to calculate the
  • 25:17 - 25:19
    current list of Q
  • 25:19 - 25:23
    values step number four the output is
  • 25:23 - 25:25
    the list of Q
  • 25:25 - 25:28
    values step number five we pass in the
  • 25:28 - 25:31
    same thing to the Target
  • 25:31 - 25:35
    Network step five we get the Q values
  • 25:35 - 25:38
    out here which should be the same as the
  • 25:38 - 25:40
    Q values from the policy
  • 25:40 - 25:44
    Network step number six is up here so
  • 25:44 - 25:47
    reminder of what it looks like step six
  • 25:47 - 25:50
    is using this formula
  • 25:50 - 25:53
    here so that's what we have here if
  • 25:53 - 25:56
    terminated if terminated just return the
  • 25:56 - 25:58
    reward otherwise use that that second
  • 25:58 - 26:01
    formula so now that we have a Target
  • 26:01 - 26:04
    we'll go to step seven step seven is
  • 26:04 - 26:06
    taking the output of Step six and
  • 26:06 - 26:09
    replacing the respective Q
  • 26:09 - 26:12
    value and that's what we're doing here
  • 26:12 - 26:15
    we're replacing the Q value of that
  • 26:15 - 26:17
    particular action with the Target that
  • 26:17 - 26:19
    was calculated up
  • 26:19 - 26:23
    above step number eight step eight is to
  • 26:23 - 26:25
    take the target values and use that to
  • 26:25 - 26:29
    train the current Q values
  • 26:29 - 26:32
    so that's what we have here we're using
  • 26:32 - 26:35
    the loss function pass in the current
  • 26:35 - 26:38
    set of Q values plus the target set of Q
  • 26:38 - 26:41
    values and then this is just standard Pi
  • 26:41 - 26:44
    torch code to optimize the policy
  • 26:44 - 26:49
    Network now step nine step nine is to
  • 26:49 - 26:52
    repeat steps 3 to 8 starting from
  • 26:52 - 26:56
    navigation to optimizing the
  • 26:57 - 26:59
    network
  • 26:59 - 27:01
    basically that's just continuing this
  • 27:01 - 27:04
    inner while loop of stepping through the
  • 27:04 - 27:06
    states and this outer for Loop of
  • 27:06 - 27:09
    stepping through each
  • 27:09 - 27:12
    episode step 10 is where we sync the
  • 27:12 - 27:16
    policy Network and the target
  • 27:17 - 27:20
    Network and we do that down here if the
  • 27:20 - 27:23
    number of steps taken is greater than
  • 27:23 - 27:25
    the network sync rate that we set then
  • 27:25 - 27:28
    we copy the policy Network into the
  • 27:28 - 27:30
    target Network and then we reset the
  • 27:30 - 27:33
    step counter also after each episode we
  • 27:33 - 27:37
    should be decaying the Epsilon value
  • 27:37 - 27:39
    after syncing the
  • 27:39 - 27:43
    network we repeat Step n again which is
  • 27:43 - 27:45
    repeating basically the navigation and
  • 27:45 - 27:47
    training
  • 27:47 - 27:50
    again so it's just going to go back up
  • 27:50 - 27:53
    here and do this all over again all the
  • 27:53 - 27:56
    way until we have finished the number of
  • 27:56 - 27:58
    episodes so that was training after
  • 27:58 - 28:01
    training we close the environment we can
  • 28:01 - 28:04
    save the policy or the weights and bias
  • 28:04 - 28:07
    into a file I'm hardcoding it here you
  • 28:07 - 28:10
    can definitely make this more Dynamic
  • 28:10 - 28:13
    here I'm creating a new graph and I'm
  • 28:13 - 28:16
    basically graphing the rewards collected
  • 28:16 - 28:18
    per episode also I'm graphing the
  • 28:18 - 28:21
    Epsilon history here after graphing I'm
  • 28:21 - 28:24
    saving the graphs into an image all
  • 28:24 - 28:27
    right so that was the train function we
  • 28:27 - 28:29
    have the talked about the optimize
  • 28:29 - 28:31
    function let me fold
  • 28:31 - 28:35
    that we already looked at that now the
  • 28:35 - 28:38
    test function is going to run the fren
  • 28:38 - 28:40
    Lake environment with the policy that we
  • 28:40 - 28:43
    learned from the train function we can
  • 28:43 - 28:46
    also pass in the number of episodes and
  • 28:46 - 28:47
    whether we want to turn on slippery or
  • 28:47 - 28:50
    not so the rest of the code is going to
  • 28:50 - 28:52
    look pretty similar to what we had in
  • 28:52 - 28:55
    the train function we instantiate the
  • 28:55 - 28:57
    environment get the number of states and
  • 28:57 - 29:01
    action s 16 and 4 here we declar the
  • 29:01 - 29:04
    policy Network load from the file that
  • 29:04 - 29:07
    we saved from training this is just P
  • 29:07 - 29:09
    torch code to switch the policy Network
  • 29:09 - 29:12
    to prediction mode or evaluation mode
  • 29:12 - 29:14
    rather than training mode we'll print
  • 29:14 - 29:17
    the train policy and then we'll Loop
  • 29:17 - 29:20
    over the episodes we set the agent up
  • 29:20 - 29:22
    top you've seen this before keep looping
  • 29:22 - 29:25
    until the agent gets terminated or
  • 29:25 - 29:28
    truncated here we're selecting the best
  • 29:28 - 29:31
    action out of the policy Network and
  • 29:31 - 29:33
    executing the action and then close the
  • 29:33 - 29:36
    environment so ideally when we run this
  • 29:36 - 29:38
    we're going to see the agent navigate
  • 29:38 - 29:40
    the map and reach the goal however if
  • 29:40 - 29:43
    slippery is on there is no guarantee
  • 29:43 - 29:45
    that the agent is going to solve it in
  • 29:45 - 29:47
    one try it might take a couple of tries
  • 29:47 - 29:50
    for the agent to solve the
  • 29:50 - 29:53
    map all right down at the main function
  • 29:53 - 29:56
    we create an instance of the fen Lake
  • 29:56 - 29:58
    dql class first first we're going to try
  • 29:58 - 30:01
    non-slippery we'll train it for a th
  • 30:01 - 30:04
    times or 1 th000 episodes we'll run the
  • 30:04 - 30:06
    test four times just because the map
  • 30:06 - 30:08
    pops up and goes away when the agent
  • 30:08 - 30:11
    gets to the goal so um we want to run it
  • 30:11 - 30:12
    a couple times so we actually can see it
  • 30:12 - 30:14
    on the screen okay I'm just hitting
  • 30:14 - 30:17
    contrl
  • 30:19 - 30:22
    F5 all right training is done and looks
  • 30:22 - 30:26
    like the agent is able to solve the map
  • 30:26 - 30:28
    we also have a print out of the policy
  • 30:28 - 30:30
    that it's learned I print it in a way
  • 30:30 - 30:33
    that matches up with the grid let's jump
  • 30:33 - 30:36
    back to this in a second let's go to our
  • 30:36 - 30:41
    files so after training we have a new
  • 30:41 - 30:45
    graph on the left side that's the number
  • 30:45 - 30:48
    of rewards collected over the 1,000
  • 30:48 - 30:51
    episodes as you can see over time it's
  • 30:51 - 30:54
    improving and finally at the end it's
  • 30:54 - 30:56
    getting a lot of rewards on the right
  • 30:56 - 30:59
    side we can see Epsilon decaying
  • 30:59 - 31:03
    starting from one slowly slowly down to
  • 31:03 - 31:06
    zero also another file that gets created
  • 31:06 - 31:09
    is it's a binary file so we can't really
  • 31:09 - 31:11
    display it but it contains the weights
  • 31:11 - 31:14
    and biases of the policy
  • 31:14 - 31:17
    Network so if we want to do the test
  • 31:17 - 31:19
    again we don't need the training we can
  • 31:19 - 31:22
    comment out the training line and just
  • 31:22 - 31:24
    run this
  • 31:24 - 31:26
    again and we can see the agent solve the
  • 31:26 - 31:29
    map
  • 31:32 - 31:34
    we can look at what policy the agent
  • 31:34 - 31:37
    learned the way this is printed it
  • 31:37 - 31:41
    matches up with the map so at state zero
  • 31:41 - 31:44
    the best action was to go right which is
  • 31:44 - 31:47
    State one at State one the best action
  • 31:47 - 31:51
    is going to go right again at state two
  • 31:51 - 31:55
    go down at St six go down again at St 10
  • 31:55 - 31:58
    go down again at State 14
  • 31:58 - 32:00
    go to the right
  • 32:00 - 32:05
    okay so it was right right down down
  • 32:05 - 32:08
    down right so remember earlier that
  • 32:08 - 32:10
    there were two possible path the one
  • 32:10 - 32:13
    that we actually learned plus the one
  • 32:13 - 32:16
    going down because Epsilon greedy has a
  • 32:16 - 32:18
    Randomness Factor whether the agent
  • 32:18 - 32:21
    learns the top path or the bottom path
  • 32:21 - 32:24
    is just somewhat random if weet train
  • 32:24 - 32:29
    again maybe it'll go down the next time
  • 32:30 - 32:34
    now let's turn the slippery Factor
  • 32:34 - 32:37
    on uncomment the training
  • 32:37 - 32:41
    line now with slippery on we expect the
  • 32:41 - 32:44
    agent to fail a lot more often or to
  • 32:44 - 32:46
    fall in the holes a lot more often let
  • 32:46 - 32:50
    me just triple the number of um
  • 32:50 - 32:55
    training and let me do 10 times for
  • 32:55 - 32:58
    testing okay I'm running this again
  • 32:58 - 33:00
    again with slippery Turn on there's no
  • 33:00 - 33:02
    guarantee that after training the agent
  • 33:02 - 33:05
    is going to be able to pass every
  • 33:05 - 33:08
    episode let's make it come
  • 33:08 - 33:11
    up you see it trying to go to the bottom
  • 33:11 - 33:13
    right but because of slippery it just
  • 33:13 - 33:16
    failed right there but there you go it
  • 33:16 - 33:19
    was able to solve it let's look at the
  • 33:19 - 33:22
    graph the results compared to the
  • 33:22 - 33:24
    non-slippery surface is significantly
  • 33:24 - 33:27
    worse also you want to be careful
  • 33:27 - 33:30
    getting stuck in spots like these which
  • 33:30 - 33:34
    means if I was unlucky enough to set my
  • 33:34 - 33:37
    training episodes to maybe 2200 and it
  • 33:37 - 33:40
    gets stuck here which means it learned a
  • 33:40 - 33:42
    bad policy it won't be able to solve the
  • 33:42 - 33:44
    map so in your training if you're
  • 33:44 - 33:46
    finding that the agent is not able to
  • 33:46 - 33:49
    solve the map when slippery is turned on
  • 33:49 - 33:51
    it might be because of things like these
  • 33:51 - 33:54
    okay that concludes our deep Q learning
  • 33:54 - 33:57
    tutorial I love to hear feedback from
  • 33:57 - 34:00
    you was the explanation easy to
  • 34:00 - 34:02
    understand what can I improve on and
  • 34:02 - 34:06
    what other topics are you interested in
Title:
Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning
Description:

more » « less
Video Language:
English
Duration:
34:05

English subtitles

Revisions Compare revisions