< Return to Video

Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning

  • 0:00 - 0:03
    Hey everyone. Welcome to my Deep Q-Learning
  • 0:03 - 0:04
    tutorial. Here's what to expect
  • 0:04 - 0:07
    from this video. Since Deep Q-Learning is
  • 0:07 - 0:09
    a little bit complicated to explain, I'm
  • 0:09 - 0:11
    going to use the Frozen Lake reinforcement
  • 0:11 - 0:14
    learning environment. It's a very simple
  • 0:14 - 0:17
    environment. So, I'm going to do a quick
  • 0:17 - 0:19
    intro on how the environment works. We'll
  • 0:19 - 0:22
    also quickly answer the question of why
  • 0:22 - 0:24
    we need reinforcement learning on such a
  • 0:24 - 0:27
    simple environment. Next, to navigate the
  • 0:27 - 0:30
    environment, we need to use the epsilon-greedy
  • 0:30 - 0:32
    algorithm. Both Q-learning and
  • 0:32 - 0:34
    Deep Q-Learning uses the same
  • 0:34 - 0:37
    algorithm. After we know how to navigate
  • 0:37 - 0:38
    the environment, we're going to take a
  • 0:38 - 0:41
    look at the differences of the output
  • 0:41 - 0:43
    between Q-learning and Deep Q-Learning.
  • 0:43 - 0:45
    In Q-learning, we're training a Q-table.
  • 0:45 - 0:48
    In Deep Q-Learning, we're training a Deep
  • 0:48 - 0:51
    Q-network. We'll work through how the Q-table
  • 0:51 - 0:53
    is trained, then we can see how it
  • 0:53 - 0:55
    is different from training a Deep Q-network.
  • 0:55 - 0:58
    In training a Deep Q-network, we
  • 0:58 - 1:00
    also need a technique called
  • 1:00 - 1:02
    experience replay. And after we have an
  • 1:02 - 1:05
    idea of how Deep Q-Learning works, I'm
  • 1:05 - 1:07
    going to walk through the code and also
  • 1:07 - 1:08
    run it and demo it.
  • 1:11 - 1:12
    Just in case you're not familiar with
  • 1:12 - 1:14
    how this environment works, quick recap.
  • 1:14 - 1:17
    So, the goal is to get the learning agent
  • 1:17 - 1:19
    to figure out how to get to the goal on
  • 1:19 - 1:22
    the bottom right. The actions that the
  • 1:22 - 1:24
    agent can take is left. Internally, we're
  • 1:24 - 1:28
    going to represent that as zero, down is
  • 1:28 - 1:31
    one, right is two, up is three. In this
  • 1:31 - 1:34
    case, if the agent tries to go left or up,
  • 1:34 - 1:37
    it's just going to stay in place. So, in
  • 1:37 - 1:39
    general, if it tries to go off the grid,
  • 1:39 - 1:40
    it's going to stay in the same spot.
  • 1:40 - 1:42
    Internally, we're going to represent each
  • 1:42 - 1:45
    state as: this one is 0, this one as 1,
  • 1:45 - 1:46
    2, 3,
  • 1:46 - 1:51
    4, 5, 6, all the way to
  • 1:51 - 1:55
    14, 15. So, 16 states total. Each attempt at
  • 1:55 - 1:57
    navigating the map is considered one
  • 1:57 - 2:00
    episode. The episode ends if the
  • 2:00 - 2:04
    agent falls into one of these holes or
  • 2:04 - 2:06
    it reaches the goal. When it reaches the
  • 2:06 - 2:10
    goal, it will receive a reward of one. All
  • 2:10 - 2:13
    other states have no reward or penalty.
  • 2:13 - 2:17
    Okay, so that's how this environment
  • 2:17 - 2:19
    works. At this point, you might be
  • 2:19 - 2:20
    wondering why we would need
  • 2:20 - 2:23
    reinforcement learning to solve such a
  • 2:23 - 2:25
    simple map. Why not just use a
  • 2:25 - 2:27
    pathfinding algorithm? There is a twist
  • 2:27 - 2:29
    to this environment. There is a flag
  • 2:29 - 2:32
    called is_slippery.
  • 2:32 - 2:34
    When set to
  • 2:34 - 2:37
    true, the agent doesn't always execute
  • 2:37 - 2:40
    the action that it intends to. For
  • 2:40 - 2:42
    example, if the agent wants to go right,
  • 2:42 - 2:44
    there's only a one-third chance that it will
  • 2:44 - 2:47
    execute this action, and there's a two-thirds
  • 2:47 - 2:50
    chance of it going in an adjacent
  • 2:50 - 2:53
    direction. Now, with slippery turned on,
  • 2:53 - 2:55
    a pathfinding algorithm will not be able
  • 2:55 - 2:58
    to solve this. In fact, I'm not even sure
  • 2:58 - 3:00
    what algorithm could. That's where
  • 3:00 - 3:02
    reinforcement learning comes in. If I
  • 3:02 - 3:04
    don't know how to solve this, I can just
  • 3:04 - 3:08
    give the agent some incentive and let it
  • 3:08 - 3:10
    figure out how to solve this. So, that's
  • 3:10 - 3:13
    where reinforcement learning
  • 3:13 - 3:16
    shines. In terms of how the agent
  • 3:16 - 3:18
    navigates the map, both Deep Q-Learning
  • 3:18 - 3:20
    and Q-learning uses the epsilon-greedy
  • 3:20 - 3:23
    algorithm. Basically, we start with a
  • 3:23 - 3:26
    variable call epsilon set equal
  • 3:26 - 3:28
    to one, and then we generate a
  • 3:28 - 3:31
    random number. If the random number is
  • 3:31 - 3:33
    less than epsilon, we pick a random
  • 3:33 - 3:36
    action. Otherwise, we pick the best action
  • 3:36 - 3:39
    that we know of at the moment. And at the
  • 3:39 - 3:41
    end of each episode, we'll decrease
  • 3:41 - 3:44
    epsilon by a little bit at a time. So,
  • 3:44 - 3:47
    essentially, we start off with 100%
  • 3:47 - 3:50
    random exploration, and then, eventually,
  • 3:50 - 3:52
    near the end of training, we're going to
  • 3:52 - 3:56
    be always selecting the best action to
  • 3:57 - 4:00
    take. Before we jump into the details of
  • 4:00 - 4:02
    how the training works between Q-learning
  • 4:02 - 4:04
    versus Deep Q-Learning, let's
  • 4:04 - 4:06
    take a look at what the outcome looks
  • 4:06 - 4:09
    like. For Q-learning, the output
  • 4:09 - 4:11
    of the training is a Q-table,
  • 4:11 - 4:13
    which is nothing more than a
  • 4:13 - 4:17
    two-dimensional array consisting of 16
  • 4:17 - 4:20
    states by four actions. After training,
  • 4:20 - 4:22
    the whole table is going to be filled
  • 4:22 - 4:26
    with Q-values. For example, it might look
  • 4:26 - 4:28
    something like this.
  • 4:28 - 4:32
    So, in this case, the agent is at
  • 4:32 - 4:34
    state zero. We look at the table and see
  • 4:34 - 4:37
    what the maximum value is. In this case,
  • 4:37 - 4:40
    it's to go right. So, this is the
  • 4:40 - 4:43
    prescribed action. Over at Deep Q-Learning,
  • 4:43 - 4:46
    the output is a Deep Q-network,
  • 4:46 - 4:48
    which is actually nothing more than a
  • 4:48 - 4:52
    regular feed-forward neural network. This is
  • 4:52 - 4:54
    actually what it looks like, but I think
  • 4:54 - 4:56
    we should switch back to the simplified
  • 4:56 - 5:00
    view. The input layer is going to have 16
  • 5:00 - 5:04
    nodes. The output layer is going to have
  • 5:04 - 5:05
    4
  • 5:05 - 5:09
    nodes. The way that we send an input into
  • 5:09 - 5:12
    the input layer is like this: if the
  • 5:12 - 5:14
    agent is at state zero, we're going to
  • 5:14 - 5:18
    set the first node to one and everything
  • 5:18 - 5:20
    else to
  • 5:20 - 5:25
    zero. If the agent is at state one, then
  • 5:25 - 5:27
    the first node is zero and the second node
  • 5:27 - 5:31
    is one, and everything else is zero. So,
  • 5:31 - 5:33
    this is called one-hot encoding. Just
  • 5:33 - 5:36
    one more--if it's at the
  • 5:36 - 5:41
    last... If we want to put in state 15, then
  • 5:41 - 5:44
    everything is all zeros, and state 15 is one.
  • 5:44 - 5:47
    So, that's how the input works. With the
  • 5:47 - 5:50
    input, the output that gets calculated
  • 5:50 - 5:53
    are Q-values. Now, the Q-values are not
  • 5:53 - 5:55
    going to look like what's in the Q-table,
  • 5:55 - 5:58
    but it might be something similar. Let me
  • 5:58 - 6:00
    throw some numbers in here.
  • 6:02 - 6:05
    Okay, so same thing--for this particular
  • 6:05 - 6:08
    input, the best action is the highest Q-value.
  • 6:08 - 6:11
    When training a neural network,
  • 6:11 - 6:14
    essentially, we're trying to train the
  • 6:14 - 6:16
    weights associated with each one of
  • 6:16 - 6:20
    these lines in the neural network and
  • 6:20 - 6:24
    the bias for all the hidden layers. Now,
  • 6:24 - 6:26
    how do we know how many hidden layers we
  • 6:26 - 6:29
    need? In this case, one layer was enough,
  • 6:29 - 6:31
    but you can certainly add more if
  • 6:31 - 6:34
    necessary. And how many nodes do we need
  • 6:34 - 6:37
    in the hidden layer? I tried 16, and it's
  • 6:37 - 6:40
    able to solve the map, so I'm sticking
  • 6:40 - 6:42
    with that. But you can certainly increase
  • 6:42 - 6:44
    or decrease this number to see what
  • 6:44 - 6:47
    happens. Okay, so that's the difference
  • 6:47 - 6:49
    between the output of Q-learning versus
  • 6:49 - 6:50
    Deep Q-Learning.
  • 6:53 - 6:55
    As the agent navigates the map,
  • 6:55 - 6:58
    it is using the Q-learning formula to
  • 6:58 - 7:01
    calculate the Q-values and update the Q-table.
  • 7:01 - 7:03
    Now, the formula might look a
  • 7:03 - 7:04
    little bit scary, but it's actually not
  • 7:04 - 7:07
    that bad. Let's work through some
  • 7:07 - 7:10
    examples. Let's say our agent is in state
  • 7:10 - 7:14
    14 and it's going right to get to state
  • 7:14 - 7:19
    15. We're calculating Q of state 14,
  • 7:19 - 7:24
    action 2, which is this cell. The current
  • 7:24 - 7:27
    value of that cell is zero because we
  • 7:27 - 7:30
    initialized the whole Q-table to zero at
  • 7:30 - 7:33
    the beginning, plus the learning rate. The
  • 7:33 - 7:35
    learning rate is a hyperparameter that
  • 7:35 - 7:38
    we can set. As an example, I'll just put
  • 7:38 - 7:40
    in 0.01
  • 7:40 - 7:45
    times the reward. We get a reward of one
  • 7:45 - 7:47
    because we reached the goal, plus the
  • 7:47 - 7:50
    discount factor, another parameter that
  • 7:50 - 7:55
    we set. I'll just use 0.9 times the max Q-value
  • 7:55 - 7:58
    of the new state. So, the max Q-value
  • 7:58 - 8:00
    value of state 15.
  • 8:01 - 8:04
    Since state 15 is a terminal state, it
  • 8:04 - 8:06
    will never get anything other than zeros
  • 8:06 - 8:07
    in this table.
  • 8:07 - 8:11
    The max is zero. Essentially, these two
  • 8:11 - 8:15
    are gone, subtracted by the same thing
  • 8:15 - 8:17
    that we had here.
  • 8:18 - 8:20
    Okay, so work through the math. This
  • 8:20 - 8:23
    is just 0.01,
  • 8:23 - 8:26
    so we get 0.01 here.
  • 8:26 - 8:29
    Now, let's do another one really
  • 8:29 - 8:36
    quickly. The agent is here, takes a right, Q of
  • 8:36 - 8:42
    13, going to the right. 13 is also all
  • 8:42 - 8:44
    zeros. This is the one we're
  • 8:44 - 8:48
    updating. Here is zero, plus the learning
  • 8:48 - 8:53
    rate times the reward. There is no reward,
  • 8:53 - 8:57
    plus the discount factor times max of
  • 8:57 - 9:01
    the new state, max of state
  • 9:01 - 9:05
    14. Max of state 14 is
  • 9:06 - 9:11
    0.01, subtract it by, again, it's zero.
  • 9:11 - 9:16
    This is equal to 0.009.
  • 9:16 - 9:18
    Okay, it's actually pretty
  • 9:18 - 9:20
    straightforward. Now, how the heck does
  • 9:20 - 9:23
    this formula help find a path? So, if we
  • 9:23 - 9:28
    train enough, the theory is that this
  • 9:28 - 9:32
    number is going to be really close to
  • 9:32 - 9:35
    one, and then the states next to it
  • 9:35 - 9:40
    is going to be 0.9 something,
  • 9:41 - 9:43
    and then the states adjacent
  • 9:43 - 9:48
    to that. It's probably going to be 0.8 something.
  • 9:51 - 9:53
    And if we keep going,
  • 9:57 - 10:01
    we can see that a path is,
  • 10:01 - 10:04
    actually two paths
  • 10:04 - 10:07
    are possible. So, mathematically,
  • 10:07 - 10:10
    this is how the path is found.
  • 10:13 - 10:15
    Over at Deep Q-Learning, the
  • 10:15 - 10:17
    formula is going to look like this: We
  • 10:17 - 10:20
    set Q equal to the reward if the new
  • 10:20 - 10:24
    state is a terminal state. Otherwise, we
  • 10:24 - 10:25
    set it to
  • 10:25 - 10:28
    this part
  • 10:30 - 10:32
    of the Q-learning formula.
  • 10:32 - 10:34
    Let's see how the formula is
  • 10:34 - 10:36
    going be used in training.
  • 10:36 - 10:39
    For Deep Q-Learning, we actually
  • 10:39 - 10:42
    need two neural networks. Let me walk
  • 10:42 - 10:45
    through the steps. The network on the
  • 10:45 - 10:48
    left is called the policy network. This
  • 10:48 - 10:51
    is the network that we're going to do
  • 10:51 - 10:53
    the training on. The one on the right is
  • 10:53 - 10:56
    called the target network. The target
  • 10:56 - 10:58
    network is the one that makes use of the
  • 10:58 - 11:00
    Deep Q-Learning formula. Now, let's walk
  • 11:00 - 11:03
    through the steps of training. Step one
  • 11:03 - 11:07
    is going to be creation of this policy
  • 11:07 - 11:11
    network. Step two, we make a copy of the
  • 11:11 - 11:15
    policy network into the target network.
  • 11:15 - 11:17
    So, basically, we're copying the
  • 11:17 - 11:19
    weights and the bias over here. So both
  • 11:19 - 11:22
    networks are identical. Step number three:
  • 11:22 - 11:25
    the agent navigates the map as usual.
  • 11:25 - 11:29
    Let's say the agent is here in state 14
  • 11:30 - 11:34
    and it's going into state 15. Step three
  • 11:34 - 11:37
    is just navigation. Now, step four, we
  • 11:37 - 11:40
    input state 14. Remember how we have to
  • 11:40 - 11:42
    encode the input? It's going to look like
  • 11:42 - 11:48
    this: 0, 1, 2, 3,
  • 11:48 - 11:50
    12, 13,
  • 11:50 - 11:55
    state 14, 15. The input is going to
  • 11:55 - 11:57
    look like this. As you know, neural
  • 11:57 - 12:00
    networks, when it's created, it comes with
  • 12:00 - 12:04
    a random set of weights and bias. So, with
  • 12:04 - 12:05
    this input, we're actually going to get
  • 12:05 - 12:09
    an output, even though these values are
  • 12:09 - 12:10
    pretty much meaningless.
  • 12:10 - 12:13
    So, we might get some stuff
  • 12:13 - 12:15
    that looks like
  • 12:15 - 12:18
    I'm just putting in some random numbers.
  • 12:18 - 12:21
    Okay, so these Q-values are
  • 12:21 - 12:24
    meaningless, and as a reminder, this is
  • 12:24 - 12:28
    the left action, down, right, and up. Okay.
  • 12:28 - 12:30
    Step five:
  • 12:30 - 12:32
    we do the same thing. We take the exact
  • 12:32 - 12:38
    same input, state 14, and send it into the
  • 12:38 - 12:41
    target network, which will also
  • 12:41 - 12:44
    calculate the exact same numbers
  • 12:44 - 12:47
    because the target network is the same
  • 12:47 - 12:50
    as the policy network currently.
  • 12:50 - 12:53
    Step six: this is where we
  • 12:53 - 12:58
    calculate the Q-value for state 14.
  • 12:58 - 13:00
    We're taking the action of two.
  • 13:00 - 13:04
    It's equal to, since we're going into
  • 13:04 - 13:08
    state 15, it is a terminal state,
  • 13:08 - 13:10
    we set it equal to
  • 13:10 - 13:12
    one.
  • 13:12 - 13:17
    Step seven: Step seven is to set the target.
  • 13:17 - 13:22
    Input state 14, output action two
  • 13:22 - 13:23
    is this node.
  • 13:23 - 13:25
    We take the value that we
  • 13:25 - 13:29
    calculated up on step six, and we replace
  • 13:29 - 13:33
    the Q-value in the output.
  • 13:33 - 13:36
    Step number eight: we take the target Q-values,
  • 13:36 - 13:39
    and we'll use it to train the policy network.
  • 13:39 - 13:42
    So, this value is the one that's
  • 13:42 - 13:44
    really going to change. As you know, with
  • 13:44 - 13:46
    neural networks, it doesn't go straight
  • 13:46 - 13:49
    to one. It's going to go toward that
  • 13:49 - 13:51
    direction, so maybe it'll go to
  • 13:51 - 13:54
    0.01, but if you repeat the training many,
  • 13:54 - 13:57
    many times, it will approach one.
  • 13:58 - 14:01
    Step nine is to repeat the whole thing again.
  • 14:03 - 14:05
    Of course, we're not going to
  • 14:05 - 14:07
    create the policy network or we're not going
  • 14:07 - 14:09
    to make a copy of it. So what we're
  • 14:09 - 14:13
    repeating is steps three through eight.
  • 14:14 - 14:17
    Step ten: after a certain number of steps or
  • 14:17 - 14:20
    episodes, we're going to sync the policy
  • 14:20 - 14:22
    network with the target network, which
  • 14:22 - 14:24
    means we're going to make them identical
  • 14:24 - 14:26
    by copying the weights and biases from
  • 14:26 - 14:29
    the policy over to the target network.
  • 14:29 - 14:31
    After syncing the networks, we're going
  • 14:31 - 14:36
    to repeat nine again and then repeat ten
  • 14:36 - 14:39
    until training is done. Okay, so that's
  • 14:39 - 14:42
    generally how Deep Q-Learning works.
  • 14:42 - 14:43
    That might have been a little bit
  • 14:43 - 14:45
    complicated, so maybe rewind the video,
  • 14:45 - 14:47
    watch it again, and make sure you
  • 14:47 - 14:49
    understand what's happening.
  • 14:50 - 14:52
    To effectively train a neural
  • 14:52 - 14:54
    network, we need to randomize the
  • 14:54 - 14:57
    training data that we send into the
  • 14:57 - 14:59
    neural network. However, if you remember
  • 14:59 - 15:02
    from steps three and four, the agent
  • 15:02 - 15:05
    takes an action, and then we're sending
  • 15:05 - 15:07
    that training data into the neural network.
  • 15:07 - 15:09
    So the question is, how do we
  • 15:09 - 15:12
    randomize the order of a single sample?
  • 15:12 - 15:14
    But that's where experience replay comes
  • 15:14 - 15:19
    in. We need a step 3a where we memorize
  • 15:19 - 15:21
    the agent's experience. As the agent
  • 15:21 - 15:24
    navigates the map, we store the state
  • 15:24 - 15:26
    that it was in, what action it took, the
  • 15:26 - 15:28
    new state that it reached, if there was a
  • 15:28 - 15:30
    reward or not, and if the new state
  • 15:30 - 15:33
    is a terminal state or not. We take that
  • 15:33 - 15:36
    and insert it into the memory, and the
  • 15:36 - 15:39
    memory is nothing more than a Python
  • 15:39 - 15:41
    deque. How a deque works is that as the
  • 15:41 - 15:46
    deque gets full, whatever is at the end gets
  • 15:46 - 15:50
    purged. We need a step 3b. This is the
  • 15:50 - 15:54
    replay step. This is where we take, say, 30
  • 15:54 - 15:57
    random samples from the deque and then
  • 15:57 - 16:00
    pass it on to step four for training. So
  • 16:00 - 16:02
    that's what experience replay is.
  • 16:05 - 16:07
    Before we jump into the code, if you
  • 16:07 - 16:09
    have problems installing Gymnasium,
  • 16:09 - 16:12
    especially on Windows, I've got a video for
  • 16:12 - 16:15
    that. I'm not going to walk through the Q-learning
  • 16:15 - 16:17
    code, but if you are interested
  • 16:17 - 16:19
    in that, jump to my Q-learning code.
  • 16:19 - 16:21
    walkthrough video after watching this
  • 16:21 - 16:24
    one. Also, if you have zero experience
  • 16:24 - 16:26
    with neural networks, you can check out
  • 16:26 - 16:28
    this basic tutorial on neural networks.
  • 16:28 - 16:32
    You also need PyTorch, so head over to PyTorch.org
  • 16:32 - 16:34
    and get that installed. The first
  • 16:34 - 16:36
    thing that we do is create a class to
  • 16:36 - 16:39
    represent our Deep Q-network. I'm calling
  • 16:39 - 16:43
    it DQN. As mentioned before, a Deep Q-network
  • 16:43 - 16:45
    is nothing more than a feed-forward
  • 16:45 - 16:47
    neural network, so there's really
  • 16:47 - 16:49
    nothing special about it. Here, I'm using
  • 16:49 - 16:51
    pretty standard way of creating a neural
  • 16:51 - 16:55
    network using PyTorch. If you look up any
  • 16:55 - 16:58
    PyTorch tutorial, you'll probably see
  • 16:58 - 17:00
    something just like this. So since
  • 17:00 - 17:03
    this is not a PyTorch tutorial, I'm not
  • 17:03 - 17:05
    going to spend too much time explaining
  • 17:05 - 17:07
    how PyTorch works. For this class, we
  • 17:07 - 17:09
    have to inherit the neural network
  • 17:09 - 17:12
    module, which requires us to implement
  • 17:12 - 17:15
    two functions: the _init_ function and the
  • 17:15 - 17:17
    forward function. In the _init_ function,
  • 17:17 - 17:20
    I'm passing in the number of nodes in my
  • 17:20 - 17:24
    input state, hidden layer, and output
  • 17:24 - 17:26
    state. Back at our diagram, we're going to
  • 17:26 - 17:30
    have 16 input states. As mentioned before,
  • 17:30 - 17:32
    I'm using 16 in the hidden layer. That's
  • 17:32 - 17:34
    something that you can adjust yourself.
  • 17:34 - 17:36
    And in the output layer, four nodes.
  • 17:36 - 17:40
    We declare the hidden layer, which
  • 17:40 - 17:43
    has 16 going into 16, and then the output
  • 17:43 - 17:47
    layer, 16 going into four. And then the
  • 17:47 - 17:50
    forward function, x is the training dataset,
  • 17:50 - 17:51
    and we're sending the training data
  • 17:51 - 17:53
    through the neural network. Again, this is
  • 17:53 - 17:56
    pretty common PyTorch code.
  • 17:58 - 18:00
    Next, we need a class to represent
  • 18:00 - 18:02
    the replay memory.
  • 18:02 - 18:05
    So, it's this portion that we're
  • 18:05 - 18:07
    implementing right now.
  • 18:09 - 18:12
    In the _init_ function, we'll pass in a
  • 18:12 - 18:14
    max length and then create the Python
  • 18:14 - 18:17
    deque. In the append function, we're going
  • 18:17 - 18:21
    to append the transition. The transition
  • 18:21 - 18:24
    is this tuple here: state, action, new state,
  • 18:24 - 18:27
    reward, and terminated.
  • 18:27 - 18:30
    The sample function will
  • 18:30 - 18:33
    return a random sample of whatever size
  • 18:33 - 18:35
    we want from the memory, and then the
  • 18:35 - 18:37
    len function simply returns the length
  • 18:37 - 18:39
    of the memory.
  • 18:40 - 18:42
    The Frozen Lake DQL class is
  • 18:42 - 18:44
    we're going to do our training.
  • 18:44 - 18:47
    We set the learning rate and
  • 18:47 - 18:50
    the discount factor. Those are part of
  • 18:50 - 18:53
    the Q-learning formula, and these are
  • 18:53 - 18:55
    values that you can adjust and play
  • 18:55 - 18:56
    around with.
  • 18:56 - 18:59
    The network sync rate is the
  • 18:59 - 19:01
    number of steps the agent takes before
  • 19:01 - 19:05
    syncing the policy and target networks.
  • 19:05 - 19:08
    That's the setting for step ten,
  • 19:08 - 19:13
    where we sync the policy and the target network.
  • 19:13 - 19:17
    We set the replay memory size to
  • 19:17 - 19:22
    1,000 and the replay memory sample size
  • 19:22 - 19:24
    to 32. These are also numbers that you
  • 19:24 - 19:28
    can play around with. Next is the loss
  • 19:28 - 19:31
    function and the optimizer. These two are
  • 19:31 - 19:34
    PyTorch variables. For the loss function,
  • 19:34 - 19:36
    I'm simply using the mean squared error
  • 19:36 - 19:38
    function. For the optimizer, we'll
  • 19:38 - 19:41
    initialize that at a later time. This
  • 19:41 - 19:44
    actions list is to simply map the action
  • 19:44 - 19:49
    numbers into letters for printing.
  • 19:50 - 19:52
    In the train function, we can
  • 19:52 - 19:54
    specify how many episodes we want to
  • 19:54 - 19:57
    train the agent. Whether we want to
  • 19:57 - 20:00
    render the map on screen or what do we
  • 20:00 - 20:02
    want to turn on the slippery flag.
  • 20:02 - 20:06
    We'll instantiate the Frozen Lake
  • 20:06 - 20:09
    environment, and then create a variable
  • 20:09 - 20:12
    to store the number of states. This is
  • 20:12 - 20:15
    going to be 16 and store the number of
  • 20:15 - 20:18
    actions. This is going to be four. Here, we
  • 20:18 - 20:21
    initialize epsilon to one and we
  • 20:21 - 20:24
    instantiate the replay memory. This is
  • 20:24 - 20:27
    going to be size of 1,000. Now, we
  • 20:27 - 20:30
    create the policy DQ network.
  • 20:32 - 20:34
    This is step one: create the
  • 20:34 - 20:36
    policy network.
  • 20:36 - 20:40
    We also create a target network
  • 20:40 - 20:43
    and copy the weights and bias from the
  • 20:43 - 20:46
    policy network into the target network
  • 20:46 - 20:48
    so that
  • 20:48 - 20:52
    is step number two: making the target and
  • 20:52 - 20:53
    policy network identical.
  • 20:55 - 20:56
    Before training, we'll do a
  • 20:56 - 20:58
    print out of the policy network just so
  • 20:58 - 21:02
    we can compare it to the end result. Next,
  • 21:02 - 21:04
    we initialize the optimizer that we
  • 21:04 - 21:05
    declared earlier,
  • 21:05 - 21:09
    and we're simply using the atom
  • 21:09 - 21:12
    optimizer passing in the learning rate.
  • 21:12 - 21:15
    We'll use this rewards per episode list
  • 21:15 - 21:18
    to keep track of the rewards collected
  • 21:18 - 21:18
    per episode.
  • 21:18 - 21:21
    We'll also use this epsilon
  • 21:21 - 21:24
    history list to keep track of epsilon
  • 21:24 - 21:25
    decaying over time.
  • 21:27 - 21:29
    We'll use step count to keep track
  • 21:29 - 21:31
    of the number of steps taken. This is
  • 21:31 - 21:35
    used to determine when to sync the
  • 21:35 - 21:39
    policy and target network again. So that
  • 21:39 - 21:41
    is for step ten: the syncing.
  • 21:41 - 21:45
    Next, we will loop through the
  • 21:45 - 21:47
    number of episodes that we've specified
  • 21:47 - 21:49
    when calling the train function. We'll
  • 21:49 - 21:52
    initialize the map, so the agent starts
  • 21:52 - 21:55
    at state zero. We'll initialize the
  • 21:55 - 21:58
    terminated and truncated flag. Terminated
  • 21:58 - 21:59
    is when the agent falls into the
  • 21:59 - 22:02
    hole or reaches the goal. Truncated is
  • 22:02 - 22:05
    when the agent takes more than 200
  • 22:05 - 22:08
    actions on such a small map, 4x4. This
  • 22:08 - 22:10
    is probably never going to occur, but
  • 22:10 - 22:12
    we'll keep this anyway. Now, we have a
  • 22:12 - 22:14
    while loop checking for the terminated
  • 22:14 - 22:18
    and truncated flag. Keep looping until
  • 22:18 - 22:20
    these conditions are met. This part is
  • 22:20 - 22:23
    basically the epsilon-greedy algorithm.
  • 22:23 - 22:27
    If a random number is less than epsilon,
  • 22:27 - 22:29
    we'll just pick a random action.
  • 22:29 - 22:31
    Otherwise, we'll use the policy network
  • 22:31 - 22:34
    to calculate the set of Q-values, take
  • 22:34 - 22:37
    the maximum of that, extract that item,
  • 22:37 - 22:39
    and that's going to be the best action.
  • 22:39 - 22:41
    So, we have a function here called
  • 22:41 - 22:46
    state_to_dqn input. Let's jump down there.
  • 22:46 - 22:48
    Remember that we have to take the state
  • 22:48 - 22:52
    and encode it. That's what this function does.
  • 22:52 - 22:55
    This type of encoding here.
  • 22:57 - 23:00
    Okay, let's go back.
  • 23:00 - 23:02
    With PyTorch, if we want to perform a
  • 23:02 - 23:05
    prediction, we should call this torch, no_
  • 23:05 - 23:07
    grad, so that it doesn't calculate the
  • 23:07 - 23:09
    stuff needed for training. After
  • 23:09 - 23:12
    selecting the either a random action or
  • 23:12 - 23:14
    the best action, we'll call the step
  • 23:14 - 23:17
    function to execute the action. When we
  • 23:17 - 23:20
    take that action, it's going to return
  • 23:20 - 23:22
    the new state. Whether there is a reward
  • 23:22 - 23:25
    or not. Whether it's a terminal state or
  • 23:25 - 23:27
    we got truncated. We take all that
  • 23:27 - 23:29
    information
  • 23:29 - 23:33
    and put it into our memory. So that is step
  • 23:33 - 23:38
    3a. When we did the epsilon-greedy,
  • 23:38 - 23:40
    that was the step three.
  • 23:43 - 23:45
    After executing the action, we're
  • 23:45 - 23:47
    resetting the state equal to the new
  • 23:47 - 23:50
    state. We increment our step counter if
  • 23:50 - 23:54
    we received an award. Put it on our list.
  • 23:54 - 23:56
    Now, we'll check the memory to see if we
  • 23:56 - 23:58
    have enough training data to do
  • 23:58 - 24:02
    optimization on. Also, we want to check if
  • 24:02 - 24:05
    we have collected at least one reward. If
  • 24:05 - 24:07
    we haven't collected any rewards, there's
  • 24:07 - 24:09
    really no point in optimizing the
  • 24:09 - 24:12
    network. If those conditions are met, we
  • 24:12 - 24:15
    use memory. sample and we pass in the
  • 24:15 - 24:18
    batch size, which was 32. And we get a
  • 24:18 - 24:21
    batch of training data out of the memory.
  • 24:21 - 24:23
    We'll pass that training data into the
  • 24:23 - 24:26
    optimized function, along with the policy
  • 24:26 - 24:29
    network and the target network.
  • 24:29 - 24:32
    Let's jump over to the optimize function.
  • 24:32 - 24:35
    This first line is just looking
  • 24:35 - 24:37
    at the policy network and getting the
  • 24:37 - 24:39
    number of input nodes. We expect this to
  • 24:39 - 24:44
    be 16. The current q_list and target_q_list.
  • 24:46 - 24:47
    So this is the current Q-list.
  • 24:47 - 24:50
    This is the target Q-list.
  • 24:52 - 24:55
    What I'm about to describe now is
  • 24:55 - 24:58
    step 3b: replaying the experience.
  • 25:00 - 25:02
    So, to replay the experience, we're
  • 25:02 - 25:05
    looping through the training data inside
  • 25:05 - 25:08
    the mini batch. Let's jump down here for
  • 25:08 - 25:12
    a moment. So this is step number four.
  • 25:12 - 25:14
    We're taking the states and passing it
  • 25:14 - 25:17
    into the policy network to calculate the
  • 25:17 - 25:19
    current list of Q-values.
  • 25:19 - 25:23
    Step number four: the output is
  • 25:23 - 25:25
    the list of Q-values.
  • 25:25 - 25:28
    Step number five: we pass in the
  • 25:28 - 25:31
    same thing to the target network.
  • 25:31 - 25:35
    Step five: we get the Q-values
  • 25:35 - 25:38
    out here, which should be the same as the
  • 25:38 - 25:40
    Q-values from the policy network.
  • 25:40 - 25:44
    Step number six is up here. So, a
  • 25:44 - 25:47
    reminder of what it looks like. Step six
  • 25:47 - 25:50
    is using this formula here.
  • 25:50 - 25:53
    So that's what we have here if
  • 25:53 - 25:56
    terminated. If terminated, just return the
  • 25:56 - 25:58
    reward. Otherwise, use that second
  • 25:58 - 26:01
    formula. So, now that we have a target,
  • 26:01 - 26:04
    we'll go to step seven. Step seven is
  • 26:04 - 26:06
    taking the output of step six and
  • 26:06 - 26:09
    replacing the respective Q-value.
  • 26:09 - 26:12
    And that's what we're doing here.
  • 26:12 - 26:15
    We're replacing the Q-value of that
  • 26:15 - 26:17
    particular action with the target that
  • 26:17 - 26:19
    was calculated up and above.
  • 26:19 - 26:23
    Step number eight: step eight is to
  • 26:23 - 26:25
    take the target values and use that to
  • 26:25 - 26:27
    train the current Q-values.
  • 26:30 - 26:32
    So that's what we have here. We're using
  • 26:32 - 26:35
    the loss function, pass in the current
  • 26:35 - 26:39
    set of Q-values, plus the target set of Q-values,
  • 26:39 - 26:42
    and then this is just standard PyTorch
  • 26:42 - 26:44
    code to optimize the policy
  • 26:44 - 26:49
    network. Now, step nine: step nine is to
  • 26:49 - 26:52
    repeat steps three to eight, starting from
  • 26:52 - 26:55
    navigation to optimizing the network.
  • 26:59 - 27:01
    Basically, that's just continuing this
  • 27:01 - 27:04
    inner while loop of stepping through the
  • 27:04 - 27:06
    states, and this outer for loop of
  • 27:06 - 27:09
    stepping through each episode.
  • 27:09 - 27:12
    Step ten is where we sync the
  • 27:12 - 27:15
    policy network and the target network.
  • 27:18 - 27:20
    And we do that down here if the
  • 27:20 - 27:23
    number of steps taken is greater than
  • 27:23 - 27:25
    the network sync rate that we set, then
  • 27:25 - 27:28
    we copy the policy network into the
  • 27:28 - 27:30
    target network, and then we reset the
  • 27:30 - 27:33
    step counter. Also, after each episode, we
  • 27:33 - 27:37
    should be decaying the epsilon value.
  • 27:37 - 27:39
    After syncing the network,
  • 27:39 - 27:43
    we repeat step nine again, which is
  • 27:43 - 27:45
    repeating, basically, the navigation and
  • 27:45 - 27:47
    training again.
  • 27:47 - 27:50
    So, it's just going to go back up
  • 27:50 - 27:53
    here and do this all over again, all the
  • 27:53 - 27:56
    way until we have finished the number of
  • 27:56 - 27:58
    episodes. So that was training. After
  • 27:58 - 28:01
    training, we close the environment. We can
  • 28:01 - 28:04
    save the policy or the weights and bias
  • 28:04 - 28:07
    into a file. I'm hardcoding it here. You
  • 28:07 - 28:10
    can definitely make this more dynamic.
  • 28:10 - 28:13
    Here, I'm creating a new graph and I'm
  • 28:13 - 28:16
    basically graphing the rewards collected
  • 28:16 - 28:18
    per episode. Also, I'm graphing the
  • 28:18 - 28:21
    epsilon history here. After graphing, I'm
  • 28:21 - 28:24
    saving the graphs into an image.
  • 28:24 - 28:27
    Alright, so that was the train function. We
  • 28:27 - 28:29
    have talked about the optimize
  • 28:29 - 28:31
    function. Let me fold that.
  • 28:31 - 28:35
    We already looked at that. Now, the
  • 28:35 - 28:38
    test function is going to run the Frozen
  • 28:38 - 28:40
    Lake environment with the policy that we
  • 28:40 - 28:43
    learned from the train function. We can
  • 28:43 - 28:46
    also pass in the number of episodes and
  • 28:46 - 28:47
    whether we want to turn on slippery or
  • 28:47 - 28:50
    not. So the rest of the code is going to
  • 28:50 - 28:52
    look pretty similar to what we had in
  • 28:52 - 28:55
    the train function. We instantiate the
  • 28:55 - 28:57
    environment, get the number of states and
  • 28:57 - 29:01
    actions, 16 and 4 here. We declare the
  • 29:01 - 29:04
    policy network, loaded from the file that
  • 29:04 - 29:06
    we saved from training. This is just a
  • 29:06 - 29:09
    PyTorch code to switch the policy network
  • 29:09 - 29:12
    to prediction mode or evaluation mode;
  • 29:12 - 29:14
    rather, than training mode. We'll print
  • 29:14 - 29:17
    the train policy, and then we'll loop
  • 29:17 - 29:20
    over the episodes. We set the agent up
  • 29:20 - 29:22
    top, you've seen this before, keep looping
  • 29:22 - 29:25
    until the agent gets terminated or
  • 29:25 - 29:28
    truncated. Here, we're selecting the best
  • 29:28 - 29:31
    action out of the policy network and
  • 29:31 - 29:33
    executing the action and then close the
  • 29:33 - 29:36
    environment. So, ideally, when we run this,
  • 29:36 - 29:38
    we're going to see the agent navigate
  • 29:38 - 29:40
    the map and reach the goal. However, if
  • 29:40 - 29:43
    slippery is on, there is no guarantee
  • 29:43 - 29:45
    that the agent is going to solve it in
  • 29:45 - 29:47
    one try. It might take a couple of tries
  • 29:47 - 29:50
    for the agent to solve the map.
  • 29:52 - 29:53
    Alright, down at the main function,
  • 29:53 - 29:56
    we create an instance of the Frozen Lake
  • 29:56 - 29:58
    DQL class first. First, we're going to try
  • 29:58 - 30:01
    non-slippery. We'll train it for 1,000
  • 30:01 - 30:04
    times or 1,000 episodes. We'll run the
  • 30:04 - 30:06
    test four times just because the map
  • 30:06 - 30:08
    pops up and goes away when the agent
  • 30:08 - 30:11
    gets to the goal, so we want to run it
  • 30:11 - 30:12
    a couple times. So, we actually can see it
  • 30:12 - 30:14
    on the screen. Okay? I'm just hitting
  • 30:14 - 30:16
    ctrl, F5.
  • 30:20 - 30:22
    Alright, training is done and looks
  • 30:22 - 30:26
    like the agent is able to solve the map.
  • 30:26 - 30:28
    We also have a print out of the policy
  • 30:28 - 30:30
    that it's learned. I print it in a way
  • 30:30 - 30:33
    that matches up with the grid. Let's jump
  • 30:33 - 30:36
    back to this in a second. Let's go to our
  • 30:36 - 30:41
    files. So, after training, we have a new graph
  • 30:41 - 30:45
    on the left side. That's the number
  • 30:45 - 30:48
    of rewards collected over the 1,000
  • 30:48 - 30:51
    episodes. As you can see, over time it's
  • 30:51 - 30:54
    improving, and finally, at the end it's
  • 30:54 - 30:56
    getting a lot of rewards. On the right
  • 30:56 - 30:59
    side, we can see epsilon decaying,
  • 30:59 - 31:03
    starting from one slowly, slowly down to
  • 31:03 - 31:06
    zero. Also, another file that gets created
  • 31:06 - 31:09
    is a binary file. So we can't really
  • 31:09 - 31:11
    display it, but it contains the weights
  • 31:11 - 31:14
    and biases of the policy network.
  • 31:14 - 31:17
    So, if we want to do the test
  • 31:17 - 31:19
    again, we don't need the training. We can
  • 31:19 - 31:22
    comment out the training line and just
  • 31:22 - 31:24
    run this again.
  • 31:24 - 31:26
    And we can see the agent solve the
  • 31:26 - 31:27
    map.
  • 31:32 - 31:34
    We can look at what policy the agent
  • 31:34 - 31:37
    learned, the way this is printed, it
  • 31:37 - 31:41
    matches up with the map. So at state zero,
  • 31:41 - 31:44
    the best action was to go right, which is
  • 31:44 - 31:47
    state one. At state one, the best action
  • 31:47 - 31:51
    is going to go right again. At state two,
  • 31:51 - 31:55
    go down. At state six, go down again. At state ten,
  • 31:55 - 31:58
    go down again. At state 14,
  • 31:58 - 31:59
    go to the right.
  • 31:59 - 32:00
    Okay.
  • 32:01 - 32:05
    So it was right, right, down, down,
  • 32:05 - 32:08
    down, right. So remember, earlier that
  • 32:08 - 32:10
    there were two possible paths, the one
  • 32:10 - 32:13
    that we actually learned plus the one
  • 32:13 - 32:16
    going down. Because epsilon-greedy has a
  • 32:16 - 32:18
    randomness factor. Whether the agent
  • 32:18 - 32:21
    learns the top path or the bottom path
  • 32:21 - 32:24
    is just somewhat random. If we train
  • 32:24 - 32:27
    again, maybe it'll go down the next time.
  • 32:30 - 32:34
    Now, let's turn the slippery factor on,
  • 32:34 - 32:37
    uncomment the training line.
  • 32:37 - 32:41
    Now with slippery on, we expect the
  • 32:41 - 32:44
    agent to fail a lot more often or to
  • 32:44 - 32:46
    fall in the holes a lot more often.
  • 32:46 - 32:50
    Let me just triple the number of training
  • 32:50 - 32:54
    and let me do ten times for testing.
  • 32:56 - 32:58
    Okay, I'm running this again.
  • 32:58 - 33:00
    Again, with slippery turn on, there's no
  • 33:00 - 33:02
    guarantee that after training, the agent
  • 33:02 - 33:05
    is going to be able to pass every episode.
  • 33:05 - 33:07
    Let's make it come up.
  • 33:09 - 33:11
    You see it trying to go to the bottom
  • 33:11 - 33:13
    right, but because of slippery, it just
  • 33:13 - 33:16
    failed right there. But there you go.
  • 33:16 - 33:19
    It was able to solve it. Let's look at the
  • 33:19 - 33:22
    graph. The results compared to the
  • 33:22 - 33:24
    non-slippery surface is significantly
  • 33:24 - 33:27
    worse. Also, you want to be careful
  • 33:27 - 33:30
    getting stuck in spots like these, which
  • 33:30 - 33:34
    means, if I was unlucky enough to set my
  • 33:34 - 33:37
    training episodes to maybe 2,200, and it
  • 33:37 - 33:40
    gets stuck here, which means it learned a
  • 33:40 - 33:42
    bad policy, it won't be able to solve the
  • 33:42 - 33:44
    map. So, in your training, if you're
  • 33:44 - 33:46
    finding that the agent is not able to
  • 33:46 - 33:49
    solve the map when slippery is turned on,
  • 33:49 - 33:51
    it might be because of things like these.
  • 33:51 - 33:54
    Okay, that concludes our Deep Q-Learning
  • 33:54 - 33:57
    tutorial. I love to hear feedback from
  • 33:57 - 34:00
    you. Was the explanation easy to
  • 34:00 - 34:02
    understand? What can I improve on? And
  • 34:02 - 34:04
    what other topics are you interested in?
Title:
Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning
Description:

more » « less
Video Language:
English
Duration:
34:05

English subtitles

Revisions Compare revisions