Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning

Edit subtitles

0:00 - 0:03

Hey everyone. Welcome to my Deep Q-Learning
0:03 - 0:04

tutorial. Here's what to expect
0:04 - 0:07

from this video. Since Deep Q-Learning is
0:07 - 0:09

a little bit complicated to explain, I'm
0:09 - 0:11

going to use the Frozen Lake reinforcement
0:11 - 0:14

learning environment. It's a very simple
0:14 - 0:17

environment. So, I'm going to do a quick
0:17 - 0:19

intro on how the environment works. We'll
0:19 - 0:22

also quickly answer the question of why
0:22 - 0:24

we need reinforcement learning on such a
0:24 - 0:27

simple environment. Next, to navigate the
0:27 - 0:30

environment, we need to use the epsilon-greedy
0:30 - 0:32

algorithm. Both Q-learning and
0:32 - 0:34

Deep Q-Learning uses the same
0:34 - 0:37

algorithm. After we know how to navigate
0:37 - 0:38

the environment, we're going to take a
0:38 - 0:41

look at the differences of the output
0:41 - 0:43

between Q-learning and Deep Q-Learning.
0:43 - 0:45

In Q-learning, we're training a Q-table.
0:45 - 0:48

In Deep Q-Learning, we're training a Deep
0:48 - 0:51

Q-network. We'll work through how the Q-table
0:51 - 0:53

is trained, then we can see how it
0:53 - 0:55

is different from training a Deep Q-network.
0:55 - 0:58

In training a Deep Q-network, we
0:58 - 1:00

also need a technique called
1:00 - 1:02

experience replay. And after we have an
1:02 - 1:05

idea of how Deep Q-Learning works, I'm
1:05 - 1:07

going to walk through the code and also
1:07 - 1:08

run it and demo it.
1:11 - 1:12

Just in case you're not familiar with
1:12 - 1:14

how this environment works, quick recap.
1:14 - 1:17

So, the goal is to get the learning agent
1:17 - 1:19

to figure out how to get to the goal on
1:19 - 1:22

the bottom right. The actions that the
1:22 - 1:24

agent can take is left. Internally, we're
1:24 - 1:28

going to represent that as zero, down is
1:28 - 1:31

one, right is two, up is three. In this
1:31 - 1:34

case, if the agent tries to go left or up,
1:34 - 1:37

it's just going to stay in place. So, in
1:37 - 1:39

general, if it tries to go off the grid,
1:39 - 1:40

it's going to stay in the same spot.
1:40 - 1:42

Internally, we're going to represent each
1:42 - 1:45

state as: this one is 0, this one as 1,
1:45 - 1:46

2, 3,
1:46 - 1:51

4, 5, 6, all the way to
1:51 - 1:55

14, 15. So, 16 states total. Each attempt at
1:55 - 1:57

navigating the map is considered one
1:57 - 2:00

episode. The episode ends if the
2:00 - 2:04

agent falls into one of these holes or
2:04 - 2:06

it reaches the goal. When it reaches the
2:06 - 2:10

goal, it will receive a reward of one. All
2:10 - 2:13

other states have no reward or penalty.
2:13 - 2:17

Okay, so that's how this environment
2:17 - 2:19

works. At this point, you might be
2:19 - 2:20

wondering why we would need
2:20 - 2:23

reinforcement learning to solve such a
2:23 - 2:25

simple map. Why not just use a
2:25 - 2:27

pathfinding algorithm? There is a twist
2:27 - 2:29

to this environment. There is a flag
2:29 - 2:32

called is_slippery.
2:32 - 2:34

When set to
2:34 - 2:37

true, the agent doesn't always execute
2:37 - 2:40

the action that it intends to. For
2:40 - 2:42

example, if the agent wants to go right,
2:42 - 2:44

there's only a one-third chance that it will
2:44 - 2:47

execute this action, and there's a two-thirds
2:47 - 2:50

chance of it going in an adjacent
2:50 - 2:53

direction. Now, with slippery turned on,
2:53 - 2:55

a pathfinding algorithm will not be able
2:55 - 2:58

to solve this. In fact, I'm not even sure
2:58 - 3:00

what algorithm could. That's where
3:00 - 3:02

reinforcement learning comes in. If I
3:02 - 3:04

don't know how to solve this, I can just
3:04 - 3:08

give the agent some incentive and let it
3:08 - 3:10

figure out how to solve this. So, that's
3:10 - 3:13

where reinforcement learning
3:13 - 3:16

shines. In terms of how the agent
3:16 - 3:18

navigates the map, both Deep Q-Learning
3:18 - 3:20

and Q-learning uses the epsilon-greedy
3:20 - 3:23

algorithm. Basically, we start with a
3:23 - 3:26

variable call epsilon set equal
3:26 - 3:28

to one, and then we generate a
3:28 - 3:31

random number. If the random number is
3:31 - 3:33

less than epsilon, we pick a random
3:33 - 3:36

action. Otherwise, we pick the best action
3:36 - 3:39

that we know of at the moment. And at the
3:39 - 3:41

end of each episode, we'll decrease
3:41 - 3:44

epsilon by a little bit at a time. So,
3:44 - 3:47

essentially, we start off with 100%
3:47 - 3:50

random exploration, and then, eventually,
3:50 - 3:52

near the end of training, we're going to
3:52 - 3:56

be always selecting the best action to
3:57 - 4:00

take. Before we jump into the details of
4:00 - 4:02

how the training works between Q-learning
4:02 - 4:04

versus Deep Q-Learning, let's
4:04 - 4:06

take a look at what the outcome looks
4:06 - 4:09

like. For Q-learning, the output
4:09 - 4:11

of the training is a Q-table,
4:11 - 4:13

which is nothing more than a
4:13 - 4:17

two-dimensional array consisting of 16
4:17 - 4:20

states by four actions. After training,
4:20 - 4:22

the whole table is going to be filled
4:22 - 4:26

with Q-values. For example, it might look
4:26 - 4:28

something like this.
4:28 - 4:32

So, in this case, the agent is at
4:32 - 4:34

state zero. We look at the table and see
4:34 - 4:37

what the maximum value is. In this case,
4:37 - 4:40

it's to go right. So, this is the
4:40 - 4:43

prescribed action. Over at Deep Q-Learning,
4:43 - 4:46

the output is a Deep Q-network,
4:46 - 4:48

which is actually nothing more than a
4:48 - 4:52

regular feed-forward neural network. This is
4:52 - 4:54

actually what it looks like, but I think
4:54 - 4:56

we should switch back to the simplified
4:56 - 5:00

view. The input layer is going to have 16
5:00 - 5:04

nodes. The output layer is going to have
5:04 - 5:05

4
5:05 - 5:09

nodes. The way that we send an input into
5:09 - 5:12

the input layer is like this: if the
5:12 - 5:14

agent is at state zero, we're going to
5:14 - 5:18

set the first node to one and everything
5:18 - 5:20

else to
5:20 - 5:25

zero. If the agent is at state one, then
5:25 - 5:27

the first node is zero and the second node
5:27 - 5:31

is one, and everything else is zero. So,
5:31 - 5:33

this is called one-hot encoding. Just
5:33 - 5:36

one more--if it's at the
5:36 - 5:41

last... If we want to put in state 15, then
5:41 - 5:44

everything is all zeros, and state 15 is one.
5:44 - 5:47

So, that's how the input works. With the
5:47 - 5:50

input, the output that gets calculated
5:50 - 5:53

are Q-values. Now, the Q-values are not
5:53 - 5:55

going to look like what's in the Q-table,
5:55 - 5:58

but it might be something similar. Let me
5:58 - 6:00

throw some numbers in here.
6:02 - 6:05

Okay, so same thing--for this particular
6:05 - 6:08

input, the best action is the highest Q-value.
6:08 - 6:11

When training a neural network,
6:11 - 6:14

essentially, we're trying to train the
6:14 - 6:16

weights associated with each one of
6:16 - 6:20

these lines in the neural network and
6:20 - 6:24

the bias for all the hidden layers. Now,
6:24 - 6:26

how do we know how many hidden layers we
6:26 - 6:29

need? In this case, one layer was enough,
6:29 - 6:31

but you can certainly add more if
6:31 - 6:34

necessary. And how many nodes do we need
6:34 - 6:37

in the hidden layer? I tried 16, and it's
6:37 - 6:40

able to solve the map, so I'm sticking
6:40 - 6:42

with that. But you can certainly increase
6:42 - 6:44

or decrease this number to see what
6:44 - 6:47

happens. Okay, so that's the difference
6:47 - 6:49

between the output of Q-learning versus
6:49 - 6:50

Deep Q-Learning.
6:53 - 6:55

As the agent navigates the map,
6:55 - 6:58

it is using the Q-learning formula to
6:58 - 7:01

calculate the Q-values and update the Q-table.
7:01 - 7:03

Now, the formula might look a
7:03 - 7:04

little bit scary, but it's actually not
7:04 - 7:07

that bad. Let's work through some
7:07 - 7:10

examples. Let's say our agent is in state
7:10 - 7:14

14 and it's going right to get to state
7:14 - 7:19

15. We're calculating Q of state 14,
7:19 - 7:24

action 2, which is this cell. The current
7:24 - 7:27

value of that cell is zero because we
7:27 - 7:30

initialized the whole Q-table to zero at
7:30 - 7:33

the beginning, plus the learning rate. The
7:33 - 7:35

learning rate is a hyperparameter that
7:35 - 7:38

we can set. As an example, I'll just put
7:38 - 7:40

in 0.01
7:40 - 7:45

times the reward. We get a reward of one
7:45 - 7:47

because we reached the goal, plus the
7:47 - 7:50

discount factor, another parameter that
7:50 - 7:55

we set. I'll just use 0.9 times the max Q-value
7:55 - 7:58

of the new state. So, the max Q-value
7:58 - 8:00

value of state 15.
8:01 - 8:04

Since state 15 is a terminal state, it
8:04 - 8:06

will never get anything other than zeros
8:06 - 8:07

in this table.
8:07 - 8:11

The max is zero. Essentially, these two
8:11 - 8:15

are gone, subtracted by the same thing
8:15 - 8:17

that we had here.
8:18 - 8:20

Okay, so work through the math. This
8:20 - 8:23

is just 0.01,
8:23 - 8:26

so we get 0.01 here.
8:26 - 8:29

Now, let's do another one really
8:29 - 8:36

quickly. The agent is here, takes a right, Q of
8:36 - 8:42

13, going to the right. 13 is also all
8:42 - 8:44

zeros. This is the one we're
8:44 - 8:48

updating. Here is zero, plus the learning
8:48 - 8:53

rate times the reward. There is no reward,
8:53 - 8:57

plus the discount factor times max of
8:57 - 9:01

the new state, max of state
9:01 - 9:05

14. Max of state 14 is
9:06 - 9:11

0.01, subtract it by, again, it's zero.
9:11 - 9:16

This is equal to 0.009.
9:16 - 9:18

Okay, it's actually pretty
9:18 - 9:20

straightforward. Now, how the heck does
9:20 - 9:23

this formula help find a path? So, if we
9:23 - 9:28

train enough, the theory is that this
9:28 - 9:32

number is going to be really close to
9:32 - 9:35

one, and then the states next to it
9:35 - 9:40

is going to be 0.9 something,
9:41 - 9:43

and then the states adjacent
9:43 - 9:48

to that. It's probably going to be 0.8 something.
9:51 - 9:53

And if we keep going,
9:57 - 10:01

we can see that a path is,
10:01 - 10:04

actually two paths
10:04 - 10:07

are possible. So, mathematically,
10:07 - 10:10

this is how the path is found.
10:13 - 10:15

Over at Deep Q-Learning, the
10:15 - 10:17

formula is going to look like this: We
10:17 - 10:20

set Q equal to the reward if the new
10:20 - 10:24

state is a terminal state. Otherwise, we
10:24 - 10:25

set it to
10:25 - 10:28

this part
10:30 - 10:32

of the Q-learning formula.
10:32 - 10:34

Let's see how the formula is
10:34 - 10:36

going be used in training.
10:36 - 10:39

For Deep Q-Learning, we actually
10:39 - 10:42

need two neural networks. Let me walk
10:42 - 10:45

through the steps. The network on the
10:45 - 10:48

left is called the policy network. This
10:48 - 10:51

is the network that we're going to do
10:51 - 10:53

the training on. The one on the right is
10:53 - 10:56

called the target network. The target
10:56 - 10:58

network is the one that makes use of the
10:58 - 11:00

Deep Q-Learning formula. Now, let's walk
11:00 - 11:03

through the steps of training. Step one
11:03 - 11:07

is going to be creation of this policy
11:07 - 11:11

network. Step two, we make a copy of the
11:11 - 11:15

policy network into the target network.
11:15 - 11:17

So, basically, we're copying the
11:17 - 11:19

weights and the bias over here. So both
11:19 - 11:22

networks are identical. Step number three:
11:22 - 11:25

the agent navigates the map as usual.
11:25 - 11:29

Let's say the agent is here in state 14
11:30 - 11:34

and it's going into state 15. Step three
11:34 - 11:37

is just navigation. Now, step four, we
11:37 - 11:40

input state 14. Remember how we have to
11:40 - 11:42

encode the input? It's going to look like
11:42 - 11:48

this: 0, 1, 2, 3,
11:48 - 11:50

12, 13,
11:50 - 11:55

state 14, 15. The input is going to
11:55 - 11:57

look like this. As you know, neural
11:57 - 12:00

networks, when it's created, it comes with
12:00 - 12:04

a random set of weights and bias. So, with
12:04 - 12:05

this input, we're actually going to get
12:05 - 12:09

an output, even though these values are
12:09 - 12:10

pretty much meaningless.
12:10 - 12:13

So, we might get some stuff
12:13 - 12:15

that looks like
12:15 - 12:18

I'm just putting in some random numbers.
12:18 - 12:21

Okay, so these Q-values are
12:21 - 12:24

meaningless, and as a reminder, this is
12:24 - 12:28

the left action, down, right, and up. Okay.
12:28 - 12:30

Step five:
12:30 - 12:32

we do the same thing. We take the exact
12:32 - 12:38

same input, state 14, and send it into the
12:38 - 12:41

target network, which will also
12:41 - 12:44

calculate the exact same numbers
12:44 - 12:47

because the target network is the same
12:47 - 12:50

as the policy network currently.
12:50 - 12:53

Step six: this is where we
12:53 - 12:58

calculate the Q-value for state 14.
12:58 - 13:00

We're taking the action of two.
13:00 - 13:04

It's equal to, since we're going into
13:04 - 13:08

state 15, it is a terminal state,
13:08 - 13:10

we set it equal to
13:10 - 13:12

one.
13:12 - 13:17

Step seven: Step seven is to set the target.
13:17 - 13:22

Input state 14, output action two
13:22 - 13:23

is this node.
13:23 - 13:25

We take the value that we
13:25 - 13:29

calculated up on step six, and we replace
13:29 - 13:33

the Q-value in the output.
13:33 - 13:36

Step number eight: we take the target Q-values,
13:36 - 13:39

and we'll use it to train the policy network.
13:39 - 13:42

So, this value is the one that's
13:42 - 13:44

really going to change. As you know, with
13:44 - 13:46

neural networks, it doesn't go straight
13:46 - 13:49

to one. It's going to go toward that
13:49 - 13:51

direction, so maybe it'll go to
13:51 - 13:54

0.01, but if you repeat the training many,
13:54 - 13:57

many times, it will approach one.
13:58 - 14:01

Step nine is to repeat the whole thing again.
14:03 - 14:05

Of course, we're not going to
14:05 - 14:07

create the policy network or we're not going
14:07 - 14:09

to make a copy of it. So what we're
14:09 - 14:13

repeating is steps three through eight.
14:14 - 14:17

Step ten: after a certain number of steps or
14:17 - 14:20

episodes, we're going to sync the policy
14:20 - 14:22

network with the target network, which
14:22 - 14:24

means we're going to make them identical
14:24 - 14:26

by copying the weights and biases from
14:26 - 14:29

the policy over to the target network.
14:29 - 14:31

After syncing the networks, we're going
14:31 - 14:36

to repeat nine again and then repeat ten
14:36 - 14:39

until training is done. Okay, so that's
14:39 - 14:42

generally how Deep Q-Learning works.
14:42 - 14:43

That might have been a little bit
14:43 - 14:45

complicated, so maybe rewind the video,
14:45 - 14:47

watch it again, and make sure you
14:47 - 14:49

understand what's happening.
14:50 - 14:52

To effectively train a neural
14:52 - 14:54

network, we need to randomize the
14:54 - 14:57

training data that we send into the
14:57 - 14:59

neural network. However, if you remember
14:59 - 15:02

from steps three and four, the agent
15:02 - 15:05

takes an action, and then we're sending
15:05 - 15:07

that training data into the neural network.
15:07 - 15:09

So the question is, how do we
15:09 - 15:12

randomize the order of a single sample?
15:12 - 15:14

But that's where experience replay comes
15:14 - 15:19

in. We need a step 3a where we memorize
15:19 - 15:21

the agent's experience. As the agent
15:21 - 15:24

navigates the map, we store the state
15:24 - 15:26

that it was in, what action it took, the
15:26 - 15:28

new state that it reached, if there was a
15:28 - 15:30

reward or not, and if the new state
15:30 - 15:33

is a terminal state or not. We take that
15:33 - 15:36

and insert it into the memory, and the
15:36 - 15:39

memory is nothing more than a Python
15:39 - 15:41

deque. How a deque works is that as the
15:41 - 15:46

deque gets full, whatever is at the end gets
15:46 - 15:50

purged. We need a step 3b. This is the
15:50 - 15:54

replay step. This is where we take, say, 30
15:54 - 15:57

random samples from the deque and then
15:57 - 16:00

pass it on to step four for training. So
16:00 - 16:02

that's what experience replay is.
16:05 - 16:07

Before we jump into the code, if you
16:07 - 16:09

have problems installing Gymnasium,
16:09 - 16:12

especially on Windows, I've got a video for
16:12 - 16:15

that. I'm not going to walk through the Q-learning
16:15 - 16:17

code, but if you are interested
16:17 - 16:19

in that, jump to my Q-learning code.
16:19 - 16:21

walkthrough video after watching this
16:21 - 16:24

one. Also, if you have zero experience
16:24 - 16:26

with neural networks, you can check out
16:26 - 16:28

this basic tutorial on neural networks.
16:28 - 16:32

You also need PyTorch, so head over to PyTorch.org
16:32 - 16:34

and get that installed. The first
16:34 - 16:36

thing that we do is create a class to
16:36 - 16:39

represent our Deep Q-network. I'm calling
16:39 - 16:43

it DQN. As mentioned before, a Deep Q-network
16:43 - 16:45

is nothing more than a feed-forward
16:45 - 16:47

neural network, so there's really
16:47 - 16:49

nothing special about it. Here, I'm using
16:49 - 16:51

pretty standard way of creating a neural
16:51 - 16:55

network using PyTorch. If you look up any
16:55 - 16:58

PyTorch tutorial, you'll probably see
16:58 - 17:00

something just like this. So since
17:00 - 17:03

this is not a PyTorch tutorial, I'm not
17:03 - 17:05

going to spend too much time explaining
17:05 - 17:07

how PyTorch works. For this class, we
17:07 - 17:09

have to inherit the neural network
17:09 - 17:12

module, which requires us to implement
17:12 - 17:15

two functions: the _init_ function and the
17:15 - 17:17

forward function. In the _init_ function,
17:17 - 17:20

I'm passing in the number of nodes in my
17:20 - 17:24

input state, hidden layer, and output
17:24 - 17:26

state. Back at our diagram, we're going to
17:26 - 17:30

have 16 input states. As mentioned before,
17:30 - 17:32

I'm using 16 in the hidden layer. That's
17:32 - 17:34

something that you can adjust yourself.
17:34 - 17:36

And in the output layer, four nodes.
17:36 - 17:40

We declare the hidden layer, which
17:40 - 17:43

has 16 going into 16, and then the output
17:43 - 17:47

layer, 16 going into four. And then the
17:47 - 17:50

forward function, x is the training dataset,
17:50 - 17:51

and we're sending the training data
17:51 - 17:53

through the neural network. Again, this is
17:53 - 17:56

pretty common PyTorch code.
17:58 - 18:00

Next, we need a class to represent
18:00 - 18:02

the replay memory.
18:02 - 18:05

So, it's this portion that we're
18:05 - 18:07

implementing right now.
18:09 - 18:12

In the _init_ function, we'll pass in a
18:12 - 18:14

max length and then create the Python
18:14 - 18:17

deque. In the append function, we're going
18:17 - 18:21

to append the transition. The transition
18:21 - 18:24

is this tuple here: state, action, new state,
18:24 - 18:27

reward, and terminated.
18:27 - 18:30

The sample function will
18:30 - 18:33

return a random sample of whatever size
18:33 - 18:35

we want from the memory, and then the
18:35 - 18:37

len function simply returns the length
18:37 - 18:39

of the memory.
18:40 - 18:42

The Frozen Lake DQL class is
18:42 - 18:44

we're going to do our training.
18:44 - 18:47

We set the learning rate and
18:47 - 18:50

the discount factor. Those are part of
18:50 - 18:53

the Q-learning formula, and these are
18:53 - 18:55

values that you can adjust and play
18:55 - 18:56

around with.
18:56 - 18:59

The network sync rate is the
18:59 - 19:01

number of steps the agent takes before
19:01 - 19:05

syncing the policy and target networks.
19:05 - 19:08

That's the setting for step ten,
19:08 - 19:13

where we sync the policy and the target network.
19:13 - 19:17

We set the replay memory size to
19:17 - 19:22

1,000 and the replay memory sample size
19:22 - 19:24

to 32. These are also numbers that you
19:24 - 19:28

can play around with. Next is the loss
19:28 - 19:31

function and the optimizer. These two are
19:31 - 19:34

PyTorch variables. For the loss function,
19:34 - 19:36

I'm simply using the mean squared error
19:36 - 19:38

function. For the optimizer, we'll
19:38 - 19:41

initialize that at a later time. This
19:41 - 19:44

actions list is to simply map the action
19:44 - 19:49

numbers into letters for printing.
19:50 - 19:52

In the train function, we can
19:52 - 19:54

specify how many episodes we want to
19:54 - 19:57

train the agent. Whether we want to
19:57 - 20:00

render the map on screen or what do we
20:00 - 20:02

want to turn on the slippery flag.
20:02 - 20:06

We'll instantiate the Frozen Lake
20:06 - 20:09

environment, and then create a variable
20:09 - 20:12

to store the number of states. This is
20:12 - 20:15

going to be 16 and store the number of
20:15 - 20:18

actions. This is going to be four. Here, we
20:18 - 20:21

initialize epsilon to one and we
20:21 - 20:24

instantiate the replay memory. This is
20:24 - 20:27

going to be size of 1,000. Now, we
20:27 - 20:30

create the policy DQ network.
20:32 - 20:34

This is step one: create the
20:34 - 20:36

policy network.
20:36 - 20:40

We also create a target network
20:40 - 20:43

and copy the weights and bias from the
20:43 - 20:46

policy network into the target network
20:46 - 20:48

so that
20:48 - 20:52

is step number two: making the target and
20:52 - 20:53

policy network identical.
20:55 - 20:56

Before training, we'll do a
20:56 - 20:58

print out of the policy network just so
20:58 - 21:02

we can compare it to the end result. Next,
21:02 - 21:04

we initialize the optimizer that we
21:04 - 21:05

declared earlier,
21:05 - 21:09

and we're simply using the atom
21:09 - 21:12

optimizer passing in the learning rate.
21:12 - 21:15

We'll use this rewards per episode list
21:15 - 21:18

to keep track of the rewards collected
21:18 - 21:18

per episode.
21:18 - 21:21

We'll also use this epsilon
21:21 - 21:24

history list to keep track of epsilon
21:24 - 21:25

decaying over time.
21:27 - 21:29

We'll use step count to keep track
21:29 - 21:31

of the number of steps taken. This is
21:31 - 21:35

used to determine when to sync the
21:35 - 21:39

policy and target network again. So that
21:39 - 21:41

is for step ten: the syncing.
21:41 - 21:45

Next, we will loop through the
21:45 - 21:47

number of episodes that we've specified
21:47 - 21:49

when calling the train function. We'll
21:49 - 21:52

initialize the map, so the agent starts
21:52 - 21:55

at state zero. We'll initialize the
21:55 - 21:58

terminated and truncated flag. Terminated
21:58 - 21:59

is when the agent falls into the
21:59 - 22:02

hole or reaches the goal. Truncated is
22:02 - 22:05

when the agent takes more than 200
22:05 - 22:08

actions on such a small map, 4x4. This
22:08 - 22:10

is probably never going to occur, but
22:10 - 22:12

we'll keep this anyway. Now, we have a
22:12 - 22:14

while loop checking for the terminated
22:14 - 22:18

and truncated flag. Keep looping until
22:18 - 22:20

these conditions are met. This part is
22:20 - 22:23

basically the epsilon-greedy algorithm.
22:23 - 22:27

If a random number is less than epsilon,
22:27 - 22:29

we'll just pick a random action.
22:29 - 22:31

Otherwise, we'll use the policy network
22:31 - 22:34

to calculate the set of Q-values, take
22:34 - 22:37

the maximum of that, extract that item,
22:37 - 22:39

and that's going to be the best action.
22:39 - 22:41

So, we have a function here called
22:41 - 22:46

state_to_dqn input. Let's jump down there.
22:46 - 22:48

Remember that we have to take the state
22:48 - 22:52

and encode it. That's what this function does.
22:52 - 22:55

This type of encoding here.
22:57 - 23:00

Okay, let's go back.
23:00 - 23:02

With PyTorch, if we want to perform a
23:02 - 23:05

prediction, we should call this torch, no_
23:05 - 23:07

grad, so that it doesn't calculate the
23:07 - 23:09

stuff needed for training. After
23:09 - 23:12

selecting the either a random action or
23:12 - 23:14

the best action, we'll call the step
23:14 - 23:17

function to execute the action. When we
23:17 - 23:20

take that action, it's going to return
23:20 - 23:22

the new state. Whether there is a reward
23:22 - 23:25

or not. Whether it's a terminal state or
23:25 - 23:27

we got truncated. We take all that
23:27 - 23:29

information
23:29 - 23:33

and put it into our memory. So that is step
23:33 - 23:38

3a. When we did the epsilon-greedy,
23:38 - 23:40

that was the step three.
23:43 - 23:45

After executing the action, we're
23:45 - 23:47

resetting the state equal to the new
23:47 - 23:50

state. We increment our step counter if
23:50 - 23:54

we received an award. Put it on our list.
23:54 - 23:56

Now, we'll check the memory to see if we
23:56 - 23:58

have enough training data to do
23:58 - 24:02

optimization on. Also, we want to check if
24:02 - 24:05

we have collected at least one reward. If
24:05 - 24:07

we haven't collected any rewards, there's
24:07 - 24:09

really no point in optimizing the
24:09 - 24:12

network. If those conditions are met, we
24:12 - 24:15

use memory. sample and we pass in the
24:15 - 24:18

batch size, which was 32. And we get a
24:18 - 24:21

batch of training data out of the memory.
24:21 - 24:23

We'll pass that training data into the
24:23 - 24:26

optimized function, along with the policy
24:26 - 24:29

network and the target network.
24:29 - 24:32

Let's jump over to the optimize function.
24:32 - 24:35

This first line is just looking
24:35 - 24:37

at the policy network and getting the
24:37 - 24:39

number of input nodes. We expect this to
24:39 - 24:44

be 16. The current q_list and target_q_list.
24:46 - 24:47

So this is the current Q-list.
24:47 - 24:50

This is the target Q-list.
24:52 - 24:55

What I'm about to describe now is
24:55 - 24:58

step 3b: replaying the experience.
25:00 - 25:02

So, to replay the experience, we're
25:02 - 25:05

looping through the training data inside
25:05 - 25:08

the mini batch. Let's jump down here for
25:08 - 25:12

a moment. So this is step number four.
25:12 - 25:14

We're taking the states and passing it
25:14 - 25:17

into the policy network to calculate the
25:17 - 25:19

current list of Q-values.
25:19 - 25:23

Step number four: the output is
25:23 - 25:25

the list of Q-values.
25:25 - 25:28

Step number five: we pass in the
25:28 - 25:31

same thing to the target network.
25:31 - 25:35

Step five: we get the Q-values
25:35 - 25:38

out here, which should be the same as the
25:38 - 25:40

Q-values from the policy network.
25:40 - 25:44

Step number six is up here. So, a
25:44 - 25:47

reminder of what it looks like. Step six
25:47 - 25:50

is using this formula here.
25:50 - 25:53

So that's what we have here if
25:53 - 25:56

terminated. If terminated, just return the
25:56 - 25:58

reward. Otherwise, use that second
25:58 - 26:01

formula. So, now that we have a target,
26:01 - 26:04

we'll go to step seven. Step seven is
26:04 - 26:06

taking the output of step six and
26:06 - 26:09

replacing the respective Q-value.
26:09 - 26:12

And that's what we're doing here.
26:12 - 26:15

We're replacing the Q-value of that
26:15 - 26:17

particular action with the target that
26:17 - 26:19

was calculated up and above.
26:19 - 26:23

Step number eight: step eight is to
26:23 - 26:25

take the target values and use that to
26:25 - 26:27

train the current Q-values.
26:30 - 26:32

So that's what we have here. We're using
26:32 - 26:35

the loss function, pass in the current
26:35 - 26:39

set of Q-values, plus the target set of Q-values,
26:39 - 26:42

and then this is just standard PyTorch
26:42 - 26:44

code to optimize the policy
26:44 - 26:49

network. Now, step nine: step nine is to
26:49 - 26:52

repeat steps three to eight, starting from
26:52 - 26:55

navigation to optimizing the network.
26:59 - 27:01

Basically, that's just continuing this
27:01 - 27:04

inner while loop of stepping through the
27:04 - 27:06

states, and this outer for loop of
27:06 - 27:09

stepping through each episode.
27:09 - 27:12

Step ten is where we sync the
27:12 - 27:15

policy network and the target network.
27:18 - 27:20

And we do that down here if the
27:20 - 27:23

number of steps taken is greater than
27:23 - 27:25

the network sync rate that we set, then
27:25 - 27:28

we copy the policy network into the
27:28 - 27:30

target network, and then we reset the
27:30 - 27:33

step counter. Also, after each episode, we
27:33 - 27:37

should be decaying the epsilon value.
27:37 - 27:39

After syncing the network,
27:39 - 27:43

we repeat step nine again, which is
27:43 - 27:45

repeating, basically, the navigation and
27:45 - 27:47

training again.
27:47 - 27:50

So, it's just going to go back up
27:50 - 27:53

here and do this all over again, all the
27:53 - 27:56

way until we have finished the number of
27:56 - 27:58

episodes. So that was training. After
27:58 - 28:01

training, we close the environment. We can
28:01 - 28:04

save the policy or the weights and bias
28:04 - 28:07

into a file. I'm hardcoding it here. You
28:07 - 28:10

can definitely make this more dynamic.
28:10 - 28:13

Here, I'm creating a new graph and I'm
28:13 - 28:16

basically graphing the rewards collected
28:16 - 28:18

per episode. Also, I'm graphing the
28:18 - 28:21

epsilon history here. After graphing, I'm
28:21 - 28:24

saving the graphs into an image.
28:24 - 28:27

Alright, so that was the train function. We
28:27 - 28:29

have talked about the optimize
28:29 - 28:31

function. Let me fold that.
28:31 - 28:35

We already looked at that. Now, the
28:35 - 28:38

test function is going to run the Frozen
28:38 - 28:40

Lake environment with the policy that we
28:40 - 28:43

learned from the train function. We can
28:43 - 28:46

also pass in the number of episodes and
28:46 - 28:47

whether we want to turn on slippery or
28:47 - 28:50

not. So the rest of the code is going to
28:50 - 28:52

look pretty similar to what we had in
28:52 - 28:55

the train function. We instantiate the
28:55 - 28:57

environment, get the number of states and
28:57 - 29:01

actions, 16 and 4 here. We declare the
29:01 - 29:04

policy network, loaded from the file that
29:04 - 29:06

we saved from training. This is just a
29:06 - 29:09

PyTorch code to switch the policy network
29:09 - 29:12

to prediction mode or evaluation mode;
29:12 - 29:14

rather, than training mode. We'll print
29:14 - 29:17

the train policy, and then we'll loop
29:17 - 29:20

over the episodes. We set the agent up
29:20 - 29:22

top, you've seen this before, keep looping
29:22 - 29:25

until the agent gets terminated or
29:25 - 29:28

truncated. Here, we're selecting the best
29:28 - 29:31

action out of the policy network and
29:31 - 29:33

executing the action and then close the
29:33 - 29:36

environment. So, ideally, when we run this,
29:36 - 29:38

we're going to see the agent navigate
29:38 - 29:40

the map and reach the goal. However, if
29:40 - 29:43

slippery is on, there is no guarantee
29:43 - 29:45

that the agent is going to solve it in
29:45 - 29:47

one try. It might take a couple of tries
29:47 - 29:50

for the agent to solve the map.
29:52 - 29:53

Alright, down at the main function,
29:53 - 29:56

we create an instance of the Frozen Lake
29:56 - 29:58

DQL class first. First, we're going to try
29:58 - 30:01

non-slippery. We'll train it for 1,000
30:01 - 30:04

times or 1,000 episodes. We'll run the
30:04 - 30:06

test four times just because the map
30:06 - 30:08

pops up and goes away when the agent
30:08 - 30:11

gets to the goal, so we want to run it
30:11 - 30:12

a couple times. So, we actually can see it
30:12 - 30:14

on the screen. Okay? I'm just hitting
30:14 - 30:16

ctrl, F5.
30:20 - 30:22

Alright, training is done and looks
30:22 - 30:26

like the agent is able to solve the map.
30:26 - 30:28

We also have a print out of the policy
30:28 - 30:30

that it's learned. I print it in a way
30:30 - 30:33

that matches up with the grid. Let's jump
30:33 - 30:36

back to this in a second. Let's go to our
30:36 - 30:41

files. So, after training, we have a new graph
30:41 - 30:45

on the left side. That's the number
30:45 - 30:48

of rewards collected over the 1,000
30:48 - 30:51

episodes. As you can see, over time it's
30:51 - 30:54

improving, and finally, at the end it's
30:54 - 30:56

getting a lot of rewards. On the right
30:56 - 30:59

side, we can see epsilon decaying,
30:59 - 31:03

starting from one slowly, slowly down to
31:03 - 31:06

zero. Also, another file that gets created
31:06 - 31:09

is a binary file. So we can't really
31:09 - 31:11

display it, but it contains the weights
31:11 - 31:14

and biases of the policy network.
31:14 - 31:17

So, if we want to do the test
31:17 - 31:19

again, we don't need the training. We can
31:19 - 31:22

comment out the training line and just
31:22 - 31:24

run this again.
31:24 - 31:26

And we can see the agent solve the
31:26 - 31:27

map.
31:32 - 31:34

We can look at what policy the agent
31:34 - 31:37

learned, the way this is printed, it
31:37 - 31:41

matches up with the map. So at state zero,
31:41 - 31:44

the best action was to go right, which is
31:44 - 31:47

state one. At state one, the best action
31:47 - 31:51

is going to go right again. At state two,
31:51 - 31:55

go down. At state six, go down again. At state ten,
31:55 - 31:58

go down again. At state 14,
31:58 - 31:59

go to the right.
31:59 - 32:00

Okay.
32:01 - 32:05

So it was right, right, down, down,
32:05 - 32:08

down, right. So remember, earlier that
32:08 - 32:10

there were two possible paths, the one
32:10 - 32:13

that we actually learned plus the one
32:13 - 32:16

going down. Because epsilon-greedy has a
32:16 - 32:18

randomness factor. Whether the agent
32:18 - 32:21

learns the top path or the bottom path
32:21 - 32:24

is just somewhat random. If we train
32:24 - 32:27

again, maybe it'll go down the next time.
32:30 - 32:34

Now, let's turn the slippery factor on,
32:34 - 32:37

uncomment the training line.
32:37 - 32:41

Now with slippery on, we expect the
32:41 - 32:44

agent to fail a lot more often or to
32:44 - 32:46

fall in the holes a lot more often.
32:46 - 32:50

Let me just triple the number of training
32:50 - 32:54

and let me do ten times for testing.
32:56 - 32:58

Okay, I'm running this again.
32:58 - 33:00

Again, with slippery turn on, there's no
33:00 - 33:02

guarantee that after training, the agent
33:02 - 33:05

is going to be able to pass every episode.
33:05 - 33:07

Let's make it come up.
33:09 - 33:11

You see it trying to go to the bottom
33:11 - 33:13

right, but because of slippery, it just
33:13 - 33:16

failed right there. But there you go.
33:16 - 33:19

It was able to solve it. Let's look at the
33:19 - 33:22

graph. The results compared to the
33:22 - 33:24

non-slippery surface is significantly
33:24 - 33:27

worse. Also, you want to be careful
33:27 - 33:30

getting stuck in spots like these, which
33:30 - 33:34

means, if I was unlucky enough to set my
33:34 - 33:37

training episodes to maybe 2,200, and it
33:37 - 33:40

gets stuck here, which means it learned a
33:40 - 33:42

bad policy, it won't be able to solve the
33:42 - 33:44

map. So, in your training, if you're
33:44 - 33:46

finding that the agent is not able to
33:46 - 33:49

solve the map when slippery is turned on,
33:49 - 33:51

it might be because of things like these.
33:51 - 33:54

Okay, that concludes our Deep Q-Learning
33:54 - 33:57

tutorial. I love to hear feedback from
33:57 - 34:00

you. Was the explanation easy to
34:00 - 34:02

understand? What can I improve on? And
34:02 - 34:04

what other topics are you interested in?

Title:: Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning
Description:: more » « less
Video Language:: English
Duration:: 34:05

	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning

English subtitles

Revisions Compare revisions

Revision 4 Edited

OEVIDEOS
Revision 3 Edited

OEVIDEOS
Revision 2 Edited

OEVIDEOS
Revision 1 Uploaded

OEVIDEOS

	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning

	Revision Number	Author	Created
	4	OEVIDEOS
	3	OEVIDEOS
	2	OEVIDEOS
	1	OEVIDEOS

Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning

Revisions Compare revisions

Our website uses cookies

Operating cookies (Required)