Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning

Rollback to version 3

0:00 - 0:03

Hey everyone. Welcome to my Deep Q-Learning
0:03 - 0:04

tutorial. Here's what to expect
0:04 - 0:07

from this video. Since Deep Q-Learning is
0:07 - 0:09

a little bit complicated to explain, I'm
0:09 - 0:11

going to use the Frozen Lake reinforcement
0:11 - 0:14

learning environment. It's a very simple
0:14 - 0:17

environment. So, I'm going to do a quick
0:17 - 0:19

intro on how the environment works. We'll
0:19 - 0:22

also quickly answer the question of why
0:22 - 0:24

we need reinforcement learning on such a
0:24 - 0:27

simple environment. Next, to navigate the
0:27 - 0:30

environment, we need to use the epsilon-greedy
0:30 - 0:32

algorithm. Both Q-learning and
0:32 - 0:34

Deep Q-Learning uses the same
0:34 - 0:37

algorithm. After we know how to navigate
0:37 - 0:38

the environment, we're going to take a
0:38 - 0:41

look at the differences of the output
0:41 - 0:43

between Q-learning and Deep Q-Learning.
0:43 - 0:45

In Q-learning, we're training a Q-table.
0:45 - 0:48

In Deep Q-Learning, we're training a Deep
0:48 - 0:51

Q-network. We'll work through how the Q-table
0:51 - 0:53

is trained, then we can see how it
0:53 - 0:55

is different from training a Deep Q-network.
0:55 - 0:58

In training a Deep Q-network, we
0:58 - 1:00

also need a technique called
1:00 - 1:02

experience replay. And after we have an
1:02 - 1:05

idea of how Deep Q-Learning works, I'm
1:05 - 1:07

going to walk through the code and also
1:07 - 1:08

run it and demo it.
1:11 - 1:12

Just in case you're not familiar with
1:12 - 1:14

how this environment works, quick recap.
1:14 - 1:17

So, the goal is to get the learning agent
1:17 - 1:19

to figure out how to get to the goal on
1:19 - 1:22

the bottom right. The actions that the
1:22 - 1:24

agent can take is left. Internally, we're
1:24 - 1:28

going to represent that as zero, down is
1:28 - 1:31

one, right is two, up is three. In this
1:31 - 1:34

case, if the agent tries to go left or up,
1:34 - 1:37

it's just going to stay in place. So, in
1:37 - 1:39

general, if it tries to go off the grid,
1:39 - 1:40

it's going to stay in the same spot.
1:40 - 1:42

Internally, we're going to represent each
1:42 - 1:45

state as: this one is 0, this one as 1,
1:45 - 1:46

2, 3,
1:46 - 1:51

4, 5, 6, all the way to
1:51 - 1:55

14, 15. So, 16 states total. Each attempt at
1:55 - 1:57

navigating the map is considered one
1:57 - 2:00

episode. The episode ends if the
2:00 - 2:04

agent falls into one of these holes or
2:04 - 2:06

it reaches the goal. When it reaches the
2:06 - 2:10

goal, it will receive a reward of one. All
2:10 - 2:13

other states have no reward or penalty.
2:13 - 2:17

Okay, so that's how this environment
2:17 - 2:19

works. At this point, you might be
2:19 - 2:20

wondering why we would need
2:20 - 2:23

reinforcement learning to solve such a
2:23 - 2:25

simple map. Why not just use a
2:25 - 2:27

pathfinding algorithm? There is a twist
2:27 - 2:29

to this environment. There is a flag
2:29 - 2:32

called is_slippery.
2:32 - 2:34

When set to
2:34 - 2:37

true, the agent doesn't always execute
2:37 - 2:40

the action that it intends to. For
2:40 - 2:42

example, if the agent wants to go right,
2:42 - 2:44

there's only a one-third chance that it will
2:44 - 2:47

execute this action, and there's a two-thirds
2:47 - 2:50

chance of it going in an adjacent
2:50 - 2:53

direction. Now, with slippery turned on,
2:53 - 2:55

a pathfinding algorithm will not be able
2:55 - 2:58

to solve this. In fact, I'm not even sure
2:58 - 3:00

what algorithm could. That's where
3:00 - 3:02

reinforcement learning comes in. If I
3:02 - 3:04

don't know how to solve this, I can just
3:04 - 3:08

give the agent some incentive and let it
3:08 - 3:10

figure out how to solve this. So, that's
3:10 - 3:13

where reinforcement learning
3:13 - 3:16

shines. In terms of how the agent
3:16 - 3:18

navigates the map, both Deep Q-Learning
3:18 - 3:20

and Q-learning uses the epsilon-greedy
3:20 - 3:23

algorithm. Basically, we start with a
3:23 - 3:26

variable call epsilon set equal
3:26 - 3:28

to one, and then we generate a
3:28 - 3:31

random number. If the random number is
3:31 - 3:33

less than epsilon, we pick a random
3:33 - 3:36

action. Otherwise, we pick the best action
3:36 - 3:39

that we know of at the moment. And at the
3:39 - 3:41

end of each episode, we'll decrease
3:41 - 3:44

epsilon by a little bit at a time. So,
3:44 - 3:47

essentially, we start off with 100%
3:47 - 3:50

random exploration, and then, eventually,
3:50 - 3:52

near the end of training, we're going to
3:52 - 3:56

be always selecting the best action to
3:57 - 4:00

take. Before we jump into the details of
4:00 - 4:02

how the training works between Q-learning
4:02 - 4:04

versus Deep Q-Learning, let's
4:04 - 4:06

take a look at what the outcome looks
4:06 - 4:09

like. For Q-learning, the output
4:09 - 4:11

of the training is a Q-table,
4:11 - 4:13

which is nothing more than a
4:13 - 4:17

two-dimensional array consisting of 16
4:17 - 4:20

states by four actions. After training,
4:20 - 4:22

the whole table is going to be filled
4:22 - 4:26

with Q-values. For example, it might look
4:26 - 4:28

something like this.
4:28 - 4:32

So, in this case, the agent is at
4:32 - 4:34

state zero. We look at the table and see
4:34 - 4:37

what the maximum value is. In this case,
4:37 - 4:40

it's to go right. So, this is the
4:40 - 4:43

prescribed action. Over at Deep Q-Learning,
4:43 - 4:46

the output is a Deep Q-network,
4:46 - 4:48

which is actually nothing more than a
4:48 - 4:52

regular feed-forward neural network. This is
4:52 - 4:54

actually what it looks like, but I think
4:54 - 4:56

we should switch back to the simplified
4:56 - 5:00

view. The input layer is going to have 16
5:00 - 5:04

nodes. The output layer is going to have
5:04 - 5:05

4
5:05 - 5:09

nodes. The way that we send an input into
5:09 - 5:12

the input layer is like this: if the
5:12 - 5:14

agent is at state zero, we're going to
5:14 - 5:18

set the first node to one and everything
5:18 - 5:20

else to
5:20 - 5:25

zero. If the agent is at state one, then
5:25 - 5:27

the first node is zero and the second node
5:27 - 5:31

is one, and everything else is zero. So,
5:31 - 5:33

this is called one-hot encoding. Just
5:33 - 5:36

one more--if it's at the
5:36 - 5:41

last... If we want to put in state 15, then
5:41 - 5:44

everything is all zeros, and state 15 is one.
5:44 - 5:47

So, that's how the input works. With the
5:47 - 5:50

input, the output that gets calculated
5:50 - 5:53

are Q-values. Now, the Q-values are not
5:53 - 5:55

going to look like what's in the Q-table,
5:55 - 5:58

but it might be something similar. Let me
5:58 - 6:00

throw some numbers in here.
6:02 - 6:05

Okay, so same thing--for this particular
6:05 - 6:08

input, the best action is the highest Q-value.
6:08 - 6:11

When training a neural network,
6:11 - 6:14

essentially, we're trying to train the
6:14 - 6:16

weights associated with each one of
6:16 - 6:20

these lines in the neural network and
6:20 - 6:24

the bias for all the hidden layers. Now,
6:24 - 6:26

how do we know how many hidden layers we
6:26 - 6:29

need? In this case, one layer was enough,
6:29 - 6:31

but you can certainly add more if
6:31 - 6:34

necessary. And how many nodes do we need
6:34 - 6:37

in the hidden layer? I tried 16, and it's
6:37 - 6:40

able to solve the map, so I'm sticking
6:40 - 6:42

with that. But you can certainly increase
6:42 - 6:44

or decrease this number to see what
6:44 - 6:47

happens. Okay, so that's the difference
6:47 - 6:49

between the output of Q-learning versus
6:49 - 6:50

Deep Q-Learning.
6:53 - 6:55

As the agent navigates the map,
6:55 - 6:58

it is using the Q-learning formula to
6:58 - 7:01

calculate the Q-values and update the Q-table.
7:01 - 7:03

Now, the formula might look a
7:03 - 7:04

little bit scary, but it's actually not
7:04 - 7:07

that bad. Let's work through some
7:07 - 7:10

examples. Let's say our agent is in state
7:10 - 7:14

14 and it's going right to get to state
7:14 - 7:19

15. We're calculating Q of state 14,
7:19 - 7:24

action 2, which is this cell. The current
7:24 - 7:27

value of that cell is zero because we
7:27 - 7:30

initialized the whole Q-table to zero at
7:30 - 7:33

the beginning, plus the learning rate. The
7:33 - 7:35

learning rate is a hyperparameter that
7:35 - 7:38

we can set. As an example, I'll just put
7:38 - 7:40

in 0.01
7:40 - 7:45

times the reward. We get a reward of one
7:45 - 7:47

because we reached the goal, plus the
7:47 - 7:50

discount factor, another parameter that
7:50 - 7:55

we set. I'll just use 0.9 times the max Q-value
7:55 - 7:58

of the new state. So, the max Q-value
7:58 - 8:00

value of state 15.
8:01 - 8:04

Since state 15 is a terminal state, it
8:04 - 8:06

will never get anything other than zeros
8:06 - 8:07

in this table.
8:07 - 8:11

The max is zero. Essentially, these two
8:11 - 8:15

are gone, subtracted by the same thing
8:15 - 8:17

that we had here.
8:18 - 8:20

Okay, so work through the math. This
8:20 - 8:23

is just 0.01,
8:23 - 8:26

so we get 0.01 here.
8:26 - 8:29

Now, let's do another one really
8:29 - 8:36

quickly. The agent is here, takes a right, Q of
8:36 - 8:42

13, going to the right. 13 is also all
8:42 - 8:44

zeros. This is the one we're
8:44 - 8:48

updating. Here is zero, plus the learning
8:48 - 8:53

rate times the reward. There is no reward,
8:53 - 8:57

plus the discount factor times max of
8:57 - 9:01

the new state, max of state
9:01 - 9:05

14. Max of state 14 is
9:06 - 9:11

0.01, subtract it by, again, it's zero.
9:11 - 9:16

This is equal to 0.009.
9:16 - 9:18

Okay, it's actually pretty
9:18 - 9:20

straightforward. Now, how the heck does
9:20 - 9:23

this formula help find a path? So, if we
9:23 - 9:28

train enough, the theory is that this
9:28 - 9:32

number is going to be really close to
9:32 - 9:35

one, and then the states next to it
9:35 - 9:40

is going to be 0.9 something,
9:41 - 9:43

and then the states adjacent
9:43 - 9:48

to that. It's probably going to be 0.8 something.
9:51 - 9:53

And if we keep going,
9:57 - 10:01

we can see that a path is,
10:01 - 10:04

actually two paths
10:04 - 10:07

are possible. So, mathematically,
10:07 - 10:10

this is how the path is found.
10:13 - 10:15

Over at Deep Q-Learning, the
10:15 - 10:17

formula is going to look like this: We
10:17 - 10:20

set Q equal to the reward if the new
10:20 - 10:24

state is a terminal state. Otherwise, we
10:24 - 10:25

set it to
10:25 - 10:28

this part
10:30 - 10:32

of the Q-learning formula.
10:32 - 10:34

Let's see how the formula is
10:34 - 10:36

going be used in training.
10:36 - 10:39

For Deep Q-Learning, we actually
10:39 - 10:42

need two neural networks. Let me walk
10:42 - 10:45

through the steps the network on the
10:45 - 10:48

left is called the policy Network this
10:48 - 10:51

is the network that we're going to do
10:51 - 10:53

the training on the one on the right is
10:53 - 10:56

called the target Network the target
10:56 - 10:58

network is the one that makes use of the
10:58 - 11:00

DQ learning formula now let's walk
11:00 - 11:03

through the steps of training step one
11:03 - 11:07

is going to be creation of this policy
11:07 - 11:11

Network step two we make a copy of the
11:11 - 11:15

policy Network into the target Network
11:15 - 11:17

so basically we're copying this the
11:17 - 11:19

weights and the bias over here so both
11:19 - 11:22

networks are identical step number three
11:22 - 11:25

the agent navigates the map as usual
11:25 - 11:30

let's say the agent is here in state 14
11:30 - 11:34

and it's going into State 15 step three
11:34 - 11:37

is just navigation now step four we
11:37 - 11:40

input State 14 remember how we have to
11:40 - 11:42

encode the input it's going to look like
11:42 - 11:48

this 0 1 2
11:48 - 11:50

3 12
11:50 - 11:55

13 State 14 15 the input is going to
11:55 - 11:57

look like this as you know neural
11:57 - 12:00

networks when it's created it comes with
12:00 - 12:04

a random set of weights and bias so with
12:04 - 12:05

this input we're actually going to get
12:05 - 12:09

an output even though these values are
12:09 - 12:10

pretty much
12:10 - 12:13

meaningless so we might get some stuff
12:13 - 12:15

that looks
12:15 - 12:18

like I'm just putting in some random
12:18 - 12:21

numbers okay so these Q values are
12:21 - 12:24

meaningless and as a reminder this is
12:24 - 12:28

the left action down right and up okay
12:28 - 12:30

step five
12:30 - 12:32

we do the same thing we take the exact
12:32 - 12:38

same input State 14 and send it into the
12:38 - 12:41

target Network which will also
12:41 - 12:44

calculate the exact same numbers
12:44 - 12:47

because the target network is the same
12:47 - 12:50

as the policy Network
12:50 - 12:53

currently step six this is where we
12:53 - 12:58

calculate the Q value for State 14 we're
12:58 - 13:00

taking the action of
13:00 - 13:04

two is equal to since we're going into
13:04 - 13:08

State 15 it is a terminal
13:08 - 13:10

State we set it equal
13:10 - 13:12

to
13:12 - 13:17

1 Step seven step seven is to set the
13:17 - 13:22

target input State 14 output action two
13:22 - 13:23

is this
13:23 - 13:25

node we take the value that we
13:25 - 13:29

calculated up on step six and we replace
13:29 - 13:33

place the Q value in the output step
13:33 - 13:36

number eight we take the target Q
13:36 - 13:39

values we use it to train the policy
13:39 - 13:42

Network so this value is the one that's
13:42 - 13:44

really going to change as you know with
13:44 - 13:46

neural networks it doesn't go straight
13:46 - 13:49

to one it's going to go toward that
13:49 - 13:51

direction so maybe maybe it'll go to
13:51 - 13:54

0.01 but if you repeat the training many
13:54 - 13:58

many times it will approach One Step
13:58 - 14:02

nine is to repeat the whole thing
14:02 - 14:05

again of course we're not going to
14:05 - 14:07

create the policy Network we not going
14:07 - 14:09

to make a copy of it so what we're
14:09 - 14:14

repeating is steps three through 8 step
14:14 - 14:17

10 after a certain number of steps or
14:17 - 14:20

episodes we're going to sync the policy
14:20 - 14:22

network with the target Network which
14:22 - 14:24

means we're going to make them identical
14:24 - 14:26

by copying the weight and biases from
14:26 - 14:29

the policy over to the Target Network
14:29 - 14:31

after syncing the networks we're going
14:31 - 14:36

to repeat nine again and then repeat 10
14:36 - 14:39

until training is done okay so that's
14:39 - 14:42

generally how deep que Learning Works um
14:42 - 14:43

that might have been a little bit
14:43 - 14:45

complicated so maybe rewind the video
14:45 - 14:47

watch it again and make sure you
14:47 - 14:49

understand what's
14:49 - 14:52

happening to effectively train a neural
14:52 - 14:54

network we need to randomize the
14:54 - 14:57

training data that we send into the
14:57 - 14:59

neural network however if you remember
14:59 - 15:02

from steps three and four the agent
15:02 - 15:05

takes an action and then we're sending
15:05 - 15:07

that training data into the neur network
15:07 - 15:09

so the question is how do we how do we
15:09 - 15:12

randomize the order of a single sample
15:12 - 15:14

but that's where experience replay comes
15:14 - 15:19

in we need a step 3A where we memorize
15:19 - 15:21

the agent's experience as the agent
15:21 - 15:24

navigates the map rest storing the state
15:24 - 15:26

that it was in what action it took the
15:26 - 15:28

new state that it reached if there was a
15:28 - 15:30

reward or not and and if the new state
15:30 - 15:33

is a terminal state or not we take that
15:33 - 15:36

and insert it into the memory and the
15:36 - 15:39

memory is nothing more than a python
15:39 - 15:41

deck how a deck works is that as the
15:41 - 15:46

deck gets full whatever at the end gets
15:46 - 15:50

purged we need a step 3B this is the
15:50 - 15:54

replay step this is where we take say 30
15:54 - 15:57

random samples from the deck and then
15:57 - 16:00

pass it on to step four for training so
16:00 - 16:03

that's what experience replay
16:04 - 16:07

is before we jump into the code if you
16:07 - 16:09

have problems installing gymnasium
16:09 - 16:12

especially on Windows I got a video for
16:12 - 16:15

that I'm not going to walk through the Q
16:15 - 16:17

learning code but if you are interested
16:17 - 16:19

in that jump to my Q learning Code
16:19 - 16:21

walkthrough video after watching this
16:21 - 16:24

one also if you have zero experience
16:24 - 16:26

with neural networks you can check out
16:26 - 16:28

this basic tutorial on neuron networks
16:28 - 16:31

you o need pytorch so head over to PYT
16:31 - 16:34

to.org and get that installed the first
16:34 - 16:36

thing that we do is create the class to
16:36 - 16:39

represent our deep Q Network I'm calling
16:39 - 16:42

it dqn as mentioned before a deep Q
16:42 - 16:45

network is nothing more than a feed
16:45 - 16:47

forward node Network so there's really
16:47 - 16:49

nothing special about it here I'm using
16:49 - 16:51

pretty standard way of creating a node
16:51 - 16:55

network using pytorch if you look up any
16:55 - 16:58

pytorch tutorial you'll probably see
16:58 - 17:00

something just just like this so since
17:00 - 17:03

this is not a py torch tutorial I'm not
17:03 - 17:05

going to spend too much time explaining
17:05 - 17:07

how py torch works for this class we
17:07 - 17:09

have to inherit the neuron network
17:09 - 17:12

module which requires us to implement
17:12 - 17:15

two functions the inet function and the
17:15 - 17:17

forward function in the init function
17:17 - 17:20

I'm passing in the number of nodes in my
17:20 - 17:24

input State hidden layer and output
17:24 - 17:26

State back at our diagram we're going to
17:26 - 17:30

have 16 input States as mentioned before
17:30 - 17:32

I'm using 16 in the hidden layer that's
17:32 - 17:34

something that you can adjust yourself
17:34 - 17:36

and in the output layer four
17:36 - 17:40

notes we declare the hidden layer which
17:40 - 17:43

has 16 going into 16 and then the output
17:43 - 17:47

layer 16 going into four and then the
17:47 - 17:49

forward function X is the training data
17:49 - 17:51

set and we're sending the training data
17:51 - 17:53

through the neuron Network again this is
17:53 - 17:57

pretty common pytorch
17:57 - 18:00

code next we need a class to represent
18:00 - 18:02

the replay
18:02 - 18:05

memory so it's this portion that we're
18:05 - 18:08

implementing right
18:08 - 18:12

now in the init function we'll pass in a
18:12 - 18:14

max length and then create the python
18:14 - 18:17

deck in the append function we're going
18:17 - 18:21

to append the transition the transition
18:21 - 18:24

is this tupo here State action new state
18:24 - 18:27

reward and
18:27 - 18:30

terminated these sample function will
18:30 - 18:33

return a random sample of whatever size
18:33 - 18:35

we want from the memory and then the
18:35 - 18:37

link function simply Returns the length
18:37 - 18:39

of the
18:39 - 18:42

memory the fen L dql class is where
18:42 - 18:44

we're going to do our
18:44 - 18:47

training we set the learning rate and
18:47 - 18:50

the discount Factor uh those are part of
18:50 - 18:53

the Q learning formula and these are
18:53 - 18:55

values that you can adjust and play
18:55 - 18:56

around
18:56 - 18:59

with the network sync rate it's the
18:59 - 19:01

number of steps the agent takes before
19:01 - 19:05

syncing the policy and Target
19:05 - 19:08

Network that's the setting for step 10
19:08 - 19:13

Where We sync the policy and the target
19:13 - 19:17

Network we set the replay memory size to
19:17 - 19:22

1,000 and the replay memory sample size
19:22 - 19:24

to 32 these are also numbers that you
19:24 - 19:28

can uh play around with next is the loss
19:28 - 19:31

function and the optimizer these two are
19:31 - 19:34

py torch variables for the loss function
19:34 - 19:36

I'm simply using the mean square error
19:36 - 19:38

function for the optimizer will
19:38 - 19:41

initialize that at a later time this
19:41 - 19:44

actions list is to Simply map the action
19:44 - 19:49

numbers into letters for
19:49 - 19:52

printing in the train function we can
19:52 - 19:54

specify how many episodes we want to
19:54 - 19:57

train the agent whether we want to
19:57 - 20:00

render the map on screen and what do we
20:00 - 20:02

want to turn on the slippery
20:02 - 20:06

flag we'll instantiate the frosen lake
20:06 - 20:09

environment and then create a variable
20:09 - 20:12

to store the number of states this is
20:12 - 20:15

going to be 16 inst store the number of
20:15 - 20:18

actions this is going to be four here we
20:18 - 20:21

initialize Epsilon to one and we
20:21 - 20:24

instantiate the replay memory this is
20:24 - 20:27

going to be size of uh 1,000 now we
20:27 - 20:31

create the policy DEQ
20:31 - 20:34

Network this is step one create the
20:34 - 20:36

policy
20:36 - 20:40

Network we also create a Target Network
20:40 - 20:43

and copy the weights and bias from the
20:43 - 20:46

policy Network into the target Network
20:46 - 20:48

so that
20:48 - 20:52

is Step number two making the Target and
20:52 - 20:53

policy Network
20:53 - 20:56

identical before training we'll do a
20:56 - 20:58

print out of the policy Network just so
20:58 - 21:02

we can compare it to the end result next
21:02 - 21:04

we initialize the optimizer that we
21:04 - 21:05

declared
21:05 - 21:09

earlier and we're simply using the atom
21:09 - 21:12

Optimizer passing in the learning rate
21:12 - 21:15

we'll use this rewards per episode list
21:15 - 21:18

to keep track of the rewards collected
21:18 - 21:18

per
21:18 - 21:21

episode we'll also use this Epsilon
21:21 - 21:24

history list to keep track of Epsilon
21:24 - 21:26

decaying over
21:26 - 21:29

time we'll use step count to keep track
21:29 - 21:31

of the number of steps taken this is
21:31 - 21:35

used to determine when to sync the
21:35 - 21:39

policy and Target Network again so that
21:39 - 21:41

is for step 10 the
21:41 - 21:45

syncing next we will Loop through the
21:45 - 21:47

number of episodes that were specified
21:47 - 21:49

when calling the train function we'll
21:49 - 21:52

initialize the map so the agent starts
21:52 - 21:55

at state zero we'll initialize the
21:55 - 21:58

terminated and truncated flag terminated
21:58 - 21:59

is when when the agent falls into the
21:59 - 22:02

hole or reaches the goal truncated is
22:02 - 22:05

when the agent takes more than 200
22:05 - 22:08

actions on such a small map 4x4 uh this
22:08 - 22:10

is probably never going to occur but
22:10 - 22:12

we'll keep this anyway now we have a
22:12 - 22:14

while loop checking for the terminated
22:14 - 22:18

and truncated flag keep looping until
22:18 - 22:20

these conditions are met this part is
22:20 - 22:23

basically the Epsilon greedy algorithm
22:23 - 22:27

if a random number is less than Epsilon
22:27 - 22:29

we'll just pick a random action
22:29 - 22:31

otherwise we'll use the policy Network
22:31 - 22:34

to calculate the set of Q values take
22:34 - 22:37

the maximum of that extract that item
22:37 - 22:39

and that's going to be the best action
22:39 - 22:42

so we have a function here called state
22:42 - 22:46

to dqn input let's jump down there
22:46 - 22:48

remember that we have to take the state
22:48 - 22:52

and encode it that's what this function
22:52 - 22:56

does this type of encoding
22:56 - 23:00

here okay let's go back
23:00 - 23:02

with P torch if we want to perform a
23:02 - 23:05

prediction we should call this torch. no
23:05 - 23:07

grads so that it doesn't calculate the
23:07 - 23:09

stuff needed for training after
23:09 - 23:12

selecting the either a random action or
23:12 - 23:14

the best action we'll call the step
23:14 - 23:17

function to execute the action when we
23:17 - 23:20

take that action it's going to return
23:20 - 23:22

the new state whether there is a reward
23:22 - 23:25

or not whether it's a terminal state or
23:25 - 23:27

we got truncated we take all that
23:27 - 23:29

information
23:29 - 23:33

and put it into our memory so that is
23:33 - 23:38

Step 3A when we did the Epsilon greedy
23:38 - 23:41

that was the step
23:41 - 23:45

three after executing the action we're
23:45 - 23:47

resetting the state equal to the new
23:47 - 23:50

state we increment our step counter if
23:50 - 23:54

we received an award put it on our list
23:54 - 23:56

now we'll check the memory to see if we
23:56 - 23:58

have enough training data to to do
23:58 - 24:02

optimization on also we want to check if
24:02 - 24:05

we have collected at least one reward if
24:05 - 24:07

we haven't collected any rewards there's
24:07 - 24:09

really no point in optimizing the
24:09 - 24:12

network if those conditions are met we
24:12 - 24:15

use memory. sample and we pass in uh the
24:15 - 24:18

batch size which was 32 and we get a
24:18 - 24:21

batch of training data out of the memory
24:21 - 24:23

we'll pass that training data into the
24:23 - 24:26

optimized function along with the policy
24:26 - 24:29

Network and the target Network
24:29 - 24:32

let's jump over to the optimize
24:32 - 24:35

function this first line is just looking
24:35 - 24:37

at the policy Network and getting the
24:37 - 24:39

number of input noes we expect this to
24:39 - 24:44

be 16 the current qist and Target
24:44 - 24:47

qist so this is the current
24:47 - 24:51

cus this is the target
24:51 - 24:55

cus what I'm about to describe now is
24:55 - 25:00

Step 3B rep playing the experience
25:00 - 25:02

so to replay the experience We're
25:02 - 25:05

looping through the training data inside
25:05 - 25:08

the mini batch let's jump down here for
25:08 - 25:12

a moment so this is Step number four
25:12 - 25:14

we're taking the states and passing it
25:14 - 25:17

into the policy Network to calculate the
25:17 - 25:19

current list of Q
25:19 - 25:23

values step number four the output is
25:23 - 25:25

the list of Q
25:25 - 25:28

values step number five we pass in the
25:28 - 25:31

same thing to the Target
25:31 - 25:35

Network step five we get the Q values
25:35 - 25:38

out here which should be the same as the
25:38 - 25:40

Q values from the policy
25:40 - 25:44

Network step number six is up here so
25:44 - 25:47

reminder of what it looks like step six
25:47 - 25:50

is using this formula
25:50 - 25:53

here so that's what we have here if
25:53 - 25:56

terminated if terminated just return the
25:56 - 25:58

reward otherwise use that that second
25:58 - 26:01

formula so now that we have a Target
26:01 - 26:04

we'll go to step seven step seven is
26:04 - 26:06

taking the output of Step six and
26:06 - 26:09

replacing the respective Q
26:09 - 26:12

value and that's what we're doing here
26:12 - 26:15

we're replacing the Q value of that
26:15 - 26:17

particular action with the Target that
26:17 - 26:19

was calculated up
26:19 - 26:23

above step number eight step eight is to
26:23 - 26:25

take the target values and use that to
26:25 - 26:29

train the current Q values
26:29 - 26:32

so that's what we have here we're using
26:32 - 26:35

the loss function pass in the current
26:35 - 26:38

set of Q values plus the target set of Q
26:38 - 26:41

values and then this is just standard Pi
26:41 - 26:44

torch code to optimize the policy
26:44 - 26:49

Network now step nine step nine is to
26:49 - 26:52

repeat steps 3 to 8 starting from
26:52 - 26:56

navigation to optimizing the
26:57 - 26:59

network
26:59 - 27:01

basically that's just continuing this
27:01 - 27:04

inner while loop of stepping through the
27:04 - 27:06

states and this outer for Loop of
27:06 - 27:09

stepping through each
27:09 - 27:12

episode step 10 is where we sync the
27:12 - 27:16

policy Network and the target
27:17 - 27:20

Network and we do that down here if the
27:20 - 27:23

number of steps taken is greater than
27:23 - 27:25

the network sync rate that we set then
27:25 - 27:28

we copy the policy Network into the
27:28 - 27:30

target Network and then we reset the
27:30 - 27:33

step counter also after each episode we
27:33 - 27:37

should be decaying the Epsilon value
27:37 - 27:39

after syncing the
27:39 - 27:43

network we repeat Step n again which is
27:43 - 27:45

repeating basically the navigation and
27:45 - 27:47

training
27:47 - 27:50

again so it's just going to go back up
27:50 - 27:53

here and do this all over again all the
27:53 - 27:56

way until we have finished the number of
27:56 - 27:58

episodes so that was training after
27:58 - 28:01

training we close the environment we can
28:01 - 28:04

save the policy or the weights and bias
28:04 - 28:07

into a file I'm hardcoding it here you
28:07 - 28:10

can definitely make this more Dynamic
28:10 - 28:13

here I'm creating a new graph and I'm
28:13 - 28:16

basically graphing the rewards collected
28:16 - 28:18

per episode also I'm graphing the
28:18 - 28:21

Epsilon history here after graphing I'm
28:21 - 28:24

saving the graphs into an image all
28:24 - 28:27

right so that was the train function we
28:27 - 28:29

have the talked about the optimize
28:29 - 28:31

function let me fold
28:31 - 28:35

that we already looked at that now the
28:35 - 28:38

test function is going to run the fren
28:38 - 28:40

Lake environment with the policy that we
28:40 - 28:43

learned from the train function we can
28:43 - 28:46

also pass in the number of episodes and
28:46 - 28:47

whether we want to turn on slippery or
28:47 - 28:50

not so the rest of the code is going to
28:50 - 28:52

look pretty similar to what we had in
28:52 - 28:55

the train function we instantiate the
28:55 - 28:57

environment get the number of states and
28:57 - 29:01

action s 16 and 4 here we declar the
29:01 - 29:04

policy Network load from the file that
29:04 - 29:07

we saved from training this is just P
29:07 - 29:09

torch code to switch the policy Network
29:09 - 29:12

to prediction mode or evaluation mode
29:12 - 29:14

rather than training mode we'll print
29:14 - 29:17

the train policy and then we'll Loop
29:17 - 29:20

over the episodes we set the agent up
29:20 - 29:22

top you've seen this before keep looping
29:22 - 29:25

until the agent gets terminated or
29:25 - 29:28

truncated here we're selecting the best
29:28 - 29:31

action out of the policy Network and
29:31 - 29:33

executing the action and then close the
29:33 - 29:36

environment so ideally when we run this
29:36 - 29:38

we're going to see the agent navigate
29:38 - 29:40

the map and reach the goal however if
29:40 - 29:43

slippery is on there is no guarantee
29:43 - 29:45

that the agent is going to solve it in
29:45 - 29:47

one try it might take a couple of tries
29:47 - 29:50

for the agent to solve the
29:50 - 29:53

map all right down at the main function
29:53 - 29:56

we create an instance of the fen Lake
29:56 - 29:58

dql class first first we're going to try
29:58 - 30:01

non-slippery we'll train it for a th
30:01 - 30:04

times or 1 th000 episodes we'll run the
30:04 - 30:06

test four times just because the map
30:06 - 30:08

pops up and goes away when the agent
30:08 - 30:11

gets to the goal so um we want to run it
30:11 - 30:12

a couple times so we actually can see it
30:12 - 30:14

on the screen okay I'm just hitting
30:14 - 30:17

contrl
30:19 - 30:22

F5 all right training is done and looks
30:22 - 30:26

like the agent is able to solve the map
30:26 - 30:28

we also have a print out of the policy
30:28 - 30:30

that it's learned I print it in a way
30:30 - 30:33

that matches up with the grid let's jump
30:33 - 30:36

back to this in a second let's go to our
30:36 - 30:41

files so after training we have a new
30:41 - 30:45

graph on the left side that's the number
30:45 - 30:48

of rewards collected over the 1,000
30:48 - 30:51

episodes as you can see over time it's
30:51 - 30:54

improving and finally at the end it's
30:54 - 30:56

getting a lot of rewards on the right
30:56 - 30:59

side we can see Epsilon decaying
30:59 - 31:03

starting from one slowly slowly down to
31:03 - 31:06

zero also another file that gets created
31:06 - 31:09

is it's a binary file so we can't really
31:09 - 31:11

display it but it contains the weights
31:11 - 31:14

and biases of the policy
31:14 - 31:17

Network so if we want to do the test
31:17 - 31:19

again we don't need the training we can
31:19 - 31:22

comment out the training line and just
31:22 - 31:24

run this
31:24 - 31:26

again and we can see the agent solve the
31:26 - 31:29

map
31:32 - 31:34

we can look at what policy the agent
31:34 - 31:37

learned the way this is printed it
31:37 - 31:41

matches up with the map so at state zero
31:41 - 31:44

the best action was to go right which is
31:44 - 31:47

State one at State one the best action
31:47 - 31:51

is going to go right again at state two
31:51 - 31:55

go down at St six go down again at St 10
31:55 - 31:58

go down again at State 14
31:58 - 32:00

go to the right
32:00 - 32:05

okay so it was right right down down
32:05 - 32:08

down right so remember earlier that
32:08 - 32:10

there were two possible path the one
32:10 - 32:13

that we actually learned plus the one
32:13 - 32:16

going down because Epsilon greedy has a
32:16 - 32:18

Randomness Factor whether the agent
32:18 - 32:21

learns the top path or the bottom path
32:21 - 32:24

is just somewhat random if weet train
32:24 - 32:29

again maybe it'll go down the next time
32:30 - 32:34

now let's turn the slippery Factor
32:34 - 32:37

on uncomment the training
32:37 - 32:41

line now with slippery on we expect the
32:41 - 32:44

agent to fail a lot more often or to
32:44 - 32:46

fall in the holes a lot more often let
32:46 - 32:50

me just triple the number of um
32:50 - 32:55

training and let me do 10 times for
32:55 - 32:58

testing okay I'm running this again
32:58 - 33:00

again with slippery Turn on there's no
33:00 - 33:02

guarantee that after training the agent
33:02 - 33:05

is going to be able to pass every
33:05 - 33:08

episode let's make it come
33:08 - 33:11

up you see it trying to go to the bottom
33:11 - 33:13

right but because of slippery it just
33:13 - 33:16

failed right there but there you go it
33:16 - 33:19

was able to solve it let's look at the
33:19 - 33:22

graph the results compared to the
33:22 - 33:24

non-slippery surface is significantly
33:24 - 33:27

worse also you want to be careful
33:27 - 33:30

getting stuck in spots like these which
33:30 - 33:34

means if I was unlucky enough to set my
33:34 - 33:37

training episodes to maybe 2200 and it
33:37 - 33:40

gets stuck here which means it learned a
33:40 - 33:42

bad policy it won't be able to solve the
33:42 - 33:44

map so in your training if you're
33:44 - 33:46

finding that the agent is not able to
33:46 - 33:49

solve the map when slippery is turned on
33:49 - 33:51

it might be because of things like these
33:51 - 33:54

okay that concludes our deep Q learning
33:54 - 33:57

tutorial I love to hear feedback from
33:57 - 34:00

you was the explanation easy to
34:00 - 34:02

understand what can I improve on and
34:02 - 34:06

what other topics are you interested in

Title:: Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning
Description:: more » « less
Video Language:: English
Duration:: 34:05

	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning	Mar 7, 2025, 1:36 AM
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning	Mar 6, 2025, 11:45 PM
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning	Mar 3, 2025, 11:18 PM
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning	Mar 3, 2025, 11:16 PM

English subtitles

Revisions Compare revisions

Revision 4 Edited

OEVIDEOS Mar 7, 2025, 1:36 AM
Revision 3 Edited

OEVIDEOS Mar 6, 2025, 11:45 PM
Revision 2 Edited

OEVIDEOS Mar 3, 2025, 11:18 PM
Revision 1 Uploaded

OEVIDEOS Mar 3, 2025, 11:16 PM

Revision Number	Author	Created
4	OEVIDEOS	Mar 7, 2025, 1:36 AM
3	OEVIDEOS	Mar 6, 2025, 11:45 PM
2	OEVIDEOS	Mar 3, 2025, 11:18 PM
1	OEVIDEOS	Mar 3, 2025, 11:16 PM

Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning

Revisions Compare revisions

Our website uses cookies

Operating cookies (Required)