Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning

Rollback to version 2

0:00 - 0:02

hey everyone welcome to my deep Q
0:02 - 0:04

learning tutorial here's what to expect
0:04 - 0:07

from this video since deep Q learning is
0:07 - 0:09

a little bit complicated to explain I'm
0:09 - 0:11

going to use the Frozen Lake reinforcement
0:11 - 0:14

learning environment it's a very simple
0:14 - 0:17

environment so I'm going to do a quick
0:17 - 0:19

intro on how the environment works or
0:19 - 0:22

also quickly answer the question of why
0:22 - 0:24

we need reinforcement learning on such a
0:24 - 0:27

simple environment next to navigate the
0:27 - 0:29

environment we need to use the Epsilon
0:29 - 0:32

greedy algorithm both Q learning and
0:32 - 0:34

deep Q learning uses the same
0:34 - 0:37

algorithm after we know how to navigate
0:37 - 0:38

the environment we're going to take a
0:38 - 0:41

look at the differences of the output
0:41 - 0:43

between Q learning and deep Q learning
0:43 - 0:45

in Q learning we're training a q table
0:45 - 0:48

in deep Q learning we're training a deep
0:48 - 0:50

Q Network we'll work through how the Q
0:50 - 0:53

table is trained then we can see how it
0:53 - 0:55

is different from training a deep Q
0:55 - 0:58

Network in training a deep Q Network we
0:58 - 1:00

also need a technique called EXP
1:00 - 1:02

experience replay and after we have an
1:02 - 1:05

idea of how deep Q learning works I'm
1:05 - 1:07

going to walk through the code and also
1:07 - 1:10

run it and demo
1:10 - 1:12

it just in case you're not familiar with
1:12 - 1:14

how this environment Works quick recap
1:14 - 1:17

so the goal is to get the learning agent
1:17 - 1:19

to figure out how to get to the goal on
1:19 - 1:22

the bottom right the actions that the
1:22 - 1:24

agent can take is left internity we're
1:24 - 1:28

going to represent that as zero down is
1:28 - 1:31

one right is two up is three in this
1:31 - 1:34

case if the agent tries to go left or up
1:34 - 1:37

it's just going to stay in place so in
1:37 - 1:39

general if it tries to go off the grid
1:39 - 1:40

it's going to stay in the same spot
1:40 - 1:42

internally we're going to represent each
1:42 - 1:45

state as this one is zero this one is 1
1:45 - 1:46

2
1:46 - 1:51

3 uh 4 5 6 all the way to
1:51 - 1:55

14 15 so 16 States total each attempt at
1:55 - 1:57

navigating the map is considered one
1:57 - 2:00

episode the episode ends if the the
2:00 - 2:04

agent falls into one of these holes or
2:04 - 2:06

it reaches the goal when it reaches the
2:06 - 2:10

goal it will receive a reward of one all
2:10 - 2:13

other states have no reward or penalty
2:13 - 2:17

okay so that's how this environment
2:17 - 2:19

works at this point you might be
2:19 - 2:20

wondering why we would need
2:20 - 2:23

reinforcement learning to solve such a
2:23 - 2:25

simple map why not just use a
2:25 - 2:27

pathfinding algorithm there is a Twist
2:27 - 2:29

to this environment there is a flag
2:29 - 2:32

called is
2:32 - 2:34

Slippery When set to
2:34 - 2:37

true the agent doesn't always execute
2:37 - 2:40

the action that it intends to for
2:40 - 2:42

example if the agent wants to go right
2:42 - 2:44

there's only a oneir chance that it will
2:44 - 2:47

execute this action and there's a two3
2:47 - 2:50

chance of it going in an adjacent
2:50 - 2:53

Direction now with slippery turn on
2:53 - 2:55

pathf finding algorithm will not be able
2:55 - 2:58

to solve this in fact I'm not even sure
2:58 - 3:00

what algorithm can that's where
3:00 - 3:02

reinforcement learning comes in if I
3:02 - 3:04

don't know how to solve this I can just
3:04 - 3:08

give the agent some incentive and let it
3:08 - 3:10

figure out how to solve this so that's
3:10 - 3:13

where reinforcement learning
3:13 - 3:16

shines in terms of how the agent
3:16 - 3:18

navigates the map both deep Q learning
3:18 - 3:20

and Q learning uses the Epsilon greedy
3:20 - 3:23

algorithm basically we start with a
3:23 - 3:26

variable call Epsilon equal and set it
3:26 - 3:28

equal to one and then we generate a
3:28 - 3:31

random number if the random number is
3:31 - 3:33

less than Epsilon we pick a random
3:33 - 3:36

action otherwise we pick the best action
3:36 - 3:39

that we know of at the moment and at the
3:39 - 3:41

end of each episode we'll decrease
3:41 - 3:44

Epsilon by a little bit at a time so
3:44 - 3:47

essentially we start off with 100%
3:47 - 3:50

random exploration and then eventually
3:50 - 3:52

near the end of training we're going to
3:52 - 3:56

be always selecting the best action to
3:57 - 4:00

take before we jump into the details of
4:00 - 4:02

how the training Works between QQ
4:02 - 4:04

learning versus deep Q learning let's
4:04 - 4:06

take a look at what the outcome looks
4:06 - 4:09

like for Q learning the output the
4:09 - 4:11

output of the training is a q table
4:11 - 4:13

which is nothing more than a
4:13 - 4:17

two-dimensional array consisting of 16
4:17 - 4:20

states by four actions after training
4:20 - 4:22

the whole table is going to be filled
4:22 - 4:26

with Q values for example it might look
4:26 - 4:28

something like
4:28 - 4:32

this so in this case the agent is at
4:32 - 4:34

state zero we look at the table and see
4:34 - 4:37

what the maximum value is in this case
4:37 - 4:40

it's to go right so this is the
4:40 - 4:43

prescribed action over at Deep Q
4:43 - 4:46

learning the output is a deep Q Network
4:46 - 4:48

which is actually nothing more than a
4:48 - 4:52

regular feedforward Nur Network this is
4:52 - 4:54

actually what it looks like but I think
4:54 - 4:56

we should switch back to the simplified
4:56 - 5:00

view the input layer is going to have 16
5:00 - 5:04

nodes the output layer is going to have
5:04 - 5:05

four
5:05 - 5:09

nodes the way that we Center input into
5:09 - 5:12

the input layer is like this if the
5:12 - 5:14

agent is at state zero we're going to
5:14 - 5:18

set the first note to one and everything
5:18 - 5:20

else
5:20 - 5:25

zero if the agent is at State one then
5:25 - 5:27

the first Noe is zero and the second Noe
5:27 - 5:31

is one and everything else is zero so
5:31 - 5:33

this is called one hot encoding just do
5:33 - 5:36

one more if it's at the
5:36 - 5:41

last if we want to put in state 15 then
5:41 - 5:44

everything all zeros and state 15 is one
5:44 - 5:47

so that's how the input works with the
5:47 - 5:50

input the output that gets calculated
5:50 - 5:53

are Q values now the Q values are not
5:53 - 5:55

going to look like what's in the Q table
5:55 - 5:58

but it might be something similar let me
5:58 - 6:01

throw some numbers in here
6:02 - 6:05

okay so same thing for this particular
6:05 - 6:08

input the best action is the highest Q
6:08 - 6:11

value when training a neural network
6:11 - 6:14

essentially we're trying to train the
6:14 - 6:16

weights associated with each one of
6:16 - 6:20

these lines in the neural network and
6:20 - 6:24

the bias for all the hidden layers now
6:24 - 6:26

how do we know how many hidden layers we
6:26 - 6:29

need in this case one layer was enough
6:29 - 6:31

but you can certainly add more if
6:31 - 6:34

necessary and how many nodes do we need
6:34 - 6:37

in the hidden layer I try 16 and it's
6:37 - 6:40

able to solve the map so I'm sticking
6:40 - 6:42

with that but you can certainly increase
6:42 - 6:44

or decrease this number to see what
6:44 - 6:47

happens okay so that's the differences
6:47 - 6:49

between the output of Q learning versus
6:49 - 6:51

deep Q
6:52 - 6:55

learning as the agent navigates the map
6:55 - 6:58

it is using the Q learning formula to
6:58 - 7:01

calculate the Q values and update the Q
7:01 - 7:03

table now the formula might look a
7:03 - 7:04

little bit scary but it's actually not
7:04 - 7:07

that bad let's work through some
7:07 - 7:10

examples let's say our agent is in state
7:10 - 7:14

14 and it's going right to get to State
7:14 - 7:19

15 we're calculating Q of State 14
7:19 - 7:24

Action 2 which is this cell the current
7:24 - 7:27

value of that cell is zero because we
7:27 - 7:30

initialize the whole Q table to zero at
7:30 - 7:33

the beginning plus the learning rate the
7:33 - 7:35

learning rate is a hyperparameter that
7:35 - 7:38

we can set as an example I'll just put
7:38 - 7:39

in
7:39 - 7:40

0.01
7:40 - 7:45

times the reward we get a reward of one
7:45 - 7:47

because we reach the goal plus the
7:47 - 7:50

discount Factor another parameter that
7:50 - 7:55

we set I'll just use 0.9 times the max Q
7:55 - 7:58

value of the new state so the max Q
7:58 - 8:01

value of State 15
8:01 - 8:04

since State 15 is a terminal state it
8:04 - 8:06

will never get anything other than zeros
8:06 - 8:07

in this
8:07 - 8:11

table Max is zero essentially these two
8:11 - 8:15

are gone subtracted by the same thing
8:15 - 8:17

that we had
8:17 - 8:20

here okay so work through the math this
8:20 - 8:23

is just
8:23 - 8:26

0.01 so we get 0.01
8:26 - 8:29

here now let's do another one really
8:29 - 8:36

quickly agent is here take a right Q of
8:36 - 8:42

13 Going to the right 13 is also all
8:42 - 8:44

zeros this is the one we're
8:44 - 8:48

updating here is zero plus the learning
8:48 - 8:53

rate times a reward there is no reward
8:53 - 8:57

plus the discount Factor times Max of
8:57 - 9:01

the new state Max of State
9:01 - 9:05

14 Max of State 14 is
9:06 - 9:11

0.01 subtracted by again it's
9:11 - 9:15

zero this is equal to
9:15 - 9:18

0.9 okay it's actually pretty
9:18 - 9:20

straightforward now how the heck does
9:20 - 9:23

this formula help find the PATH so if we
9:23 - 9:28

train enough the theory is that this
9:28 - 9:32

number is going to be really close to
9:32 - 9:35

one and then the states next to
9:35 - 9:40

it it's going to be 0.9
9:40 - 9:43

something and then the states adjacent
9:43 - 9:47

to that it's probably going to be 0.8
9:47 - 9:50

[Music]
9:50 - 9:53

something and if we keep
9:56 - 10:01

going we can see that a path is
10:01 - 10:04

is actually two
10:04 - 10:07

paths are possible so mathematically
10:07 - 10:11

this is how the path is
10:12 - 10:15

found over at Deep Q learning the
10:15 - 10:17

formula is going to look like this we
10:17 - 10:20

set Q equal to the reward if the new
10:20 - 10:24

state is a terminal State otherwise we
10:24 - 10:25

set it
10:25 - 10:28

to this
10:28 - 10:30

part
10:30 - 10:32

of the KE learning
10:32 - 10:34

formula let's see how the formula is
10:34 - 10:36

going be used in
10:36 - 10:39

training for DEQ learning we actually
10:39 - 10:42

need two neural networks let me walk
10:42 - 10:45

through the steps the network on the
10:45 - 10:48

left is called the policy Network this
10:48 - 10:51

is the network that we're going to do
10:51 - 10:53

the training on the one on the right is
10:53 - 10:56

called the target Network the target
10:56 - 10:58

network is the one that makes use of the
10:58 - 11:00

DQ learning formula now let's walk
11:00 - 11:03

through the steps of training step one
11:03 - 11:07

is going to be creation of this policy
11:07 - 11:11

Network step two we make a copy of the
11:11 - 11:15

policy Network into the target Network
11:15 - 11:17

so basically we're copying this the
11:17 - 11:19

weights and the bias over here so both
11:19 - 11:22

networks are identical step number three
11:22 - 11:25

the agent navigates the map as usual
11:25 - 11:30

let's say the agent is here in state 14
11:30 - 11:34

and it's going into State 15 step three
11:34 - 11:37

is just navigation now step four we
11:37 - 11:40

input State 14 remember how we have to
11:40 - 11:42

encode the input it's going to look like
11:42 - 11:48

this 0 1 2
11:48 - 11:50

3 12
11:50 - 11:55

13 State 14 15 the input is going to
11:55 - 11:57

look like this as you know neural
11:57 - 12:00

networks when it's created it comes with
12:00 - 12:04

a random set of weights and bias so with
12:04 - 12:05

this input we're actually going to get
12:05 - 12:09

an output even though these values are
12:09 - 12:10

pretty much
12:10 - 12:13

meaningless so we might get some stuff
12:13 - 12:15

that looks
12:15 - 12:18

like I'm just putting in some random
12:18 - 12:21

numbers okay so these Q values are
12:21 - 12:24

meaningless and as a reminder this is
12:24 - 12:28

the left action down right and up okay
12:28 - 12:30

step five
12:30 - 12:32

we do the same thing we take the exact
12:32 - 12:38

same input State 14 and send it into the
12:38 - 12:41

target Network which will also
12:41 - 12:44

calculate the exact same numbers
12:44 - 12:47

because the target network is the same
12:47 - 12:50

as the policy Network
12:50 - 12:53

currently step six this is where we
12:53 - 12:58

calculate the Q value for State 14 we're
12:58 - 13:00

taking the action of
13:00 - 13:04

two is equal to since we're going into
13:04 - 13:08

State 15 it is a terminal
13:08 - 13:10

State we set it equal
13:10 - 13:12

to
13:12 - 13:17

1 Step seven step seven is to set the
13:17 - 13:22

target input State 14 output action two
13:22 - 13:23

is this
13:23 - 13:25

node we take the value that we
13:25 - 13:29

calculated up on step six and we replace
13:29 - 13:33

place the Q value in the output step
13:33 - 13:36

number eight we take the target Q
13:36 - 13:39

values we use it to train the policy
13:39 - 13:42

Network so this value is the one that's
13:42 - 13:44

really going to change as you know with
13:44 - 13:46

neural networks it doesn't go straight
13:46 - 13:49

to one it's going to go toward that
13:49 - 13:51

direction so maybe maybe it'll go to
13:51 - 13:54

0.01 but if you repeat the training many
13:54 - 13:58

many times it will approach One Step
13:58 - 14:02

nine is to repeat the whole thing
14:02 - 14:05

again of course we're not going to
14:05 - 14:07

create the policy Network we not going
14:07 - 14:09

to make a copy of it so what we're
14:09 - 14:14

repeating is steps three through 8 step
14:14 - 14:17

10 after a certain number of steps or
14:17 - 14:20

episodes we're going to sync the policy
14:20 - 14:22

network with the target Network which
14:22 - 14:24

means we're going to make them identical
14:24 - 14:26

by copying the weight and biases from
14:26 - 14:29

the policy over to the Target Network
14:29 - 14:31

after syncing the networks we're going
14:31 - 14:36

to repeat nine again and then repeat 10
14:36 - 14:39

until training is done okay so that's
14:39 - 14:42

generally how deep que Learning Works um
14:42 - 14:43

that might have been a little bit
14:43 - 14:45

complicated so maybe rewind the video
14:45 - 14:47

watch it again and make sure you
14:47 - 14:49

understand what's
14:49 - 14:52

happening to effectively train a neural
14:52 - 14:54

network we need to randomize the
14:54 - 14:57

training data that we send into the
14:57 - 14:59

neural network however if you remember
14:59 - 15:02

from steps three and four the agent
15:02 - 15:05

takes an action and then we're sending
15:05 - 15:07

that training data into the neur network
15:07 - 15:09

so the question is how do we how do we
15:09 - 15:12

randomize the order of a single sample
15:12 - 15:14

but that's where experience replay comes
15:14 - 15:19

in we need a step 3A where we memorize
15:19 - 15:21

the agent's experience as the agent
15:21 - 15:24

navigates the map rest storing the state
15:24 - 15:26

that it was in what action it took the
15:26 - 15:28

new state that it reached if there was a
15:28 - 15:30

reward or not and and if the new state
15:30 - 15:33

is a terminal state or not we take that
15:33 - 15:36

and insert it into the memory and the
15:36 - 15:39

memory is nothing more than a python
15:39 - 15:41

deck how a deck works is that as the
15:41 - 15:46

deck gets full whatever at the end gets
15:46 - 15:50

purged we need a step 3B this is the
15:50 - 15:54

replay step this is where we take say 30
15:54 - 15:57

random samples from the deck and then
15:57 - 16:00

pass it on to step four for training so
16:00 - 16:03

that's what experience replay
16:04 - 16:07

is before we jump into the code if you
16:07 - 16:09

have problems installing gymnasium
16:09 - 16:12

especially on Windows I got a video for
16:12 - 16:15

that I'm not going to walk through the Q
16:15 - 16:17

learning code but if you are interested
16:17 - 16:19

in that jump to my Q learning Code
16:19 - 16:21

walkthrough video after watching this
16:21 - 16:24

one also if you have zero experience
16:24 - 16:26

with neural networks you can check out
16:26 - 16:28

this basic tutorial on neuron networks
16:28 - 16:31

you o need pytorch so head over to PYT
16:31 - 16:34

to.org and get that installed the first
16:34 - 16:36

thing that we do is create the class to
16:36 - 16:39

represent our deep Q Network I'm calling
16:39 - 16:42

it dqn as mentioned before a deep Q
16:42 - 16:45

network is nothing more than a feed
16:45 - 16:47

forward node Network so there's really
16:47 - 16:49

nothing special about it here I'm using
16:49 - 16:51

pretty standard way of creating a node
16:51 - 16:55

network using pytorch if you look up any
16:55 - 16:58

pytorch tutorial you'll probably see
16:58 - 17:00

something just just like this so since
17:00 - 17:03

this is not a py torch tutorial I'm not
17:03 - 17:05

going to spend too much time explaining
17:05 - 17:07

how py torch works for this class we
17:07 - 17:09

have to inherit the neuron network
17:09 - 17:12

module which requires us to implement
17:12 - 17:15

two functions the inet function and the
17:15 - 17:17

forward function in the init function
17:17 - 17:20

I'm passing in the number of nodes in my
17:20 - 17:24

input State hidden layer and output
17:24 - 17:26

State back at our diagram we're going to
17:26 - 17:30

have 16 input States as mentioned before
17:30 - 17:32

I'm using 16 in the hidden layer that's
17:32 - 17:34

something that you can adjust yourself
17:34 - 17:36

and in the output layer four
17:36 - 17:40

notes we declare the hidden layer which
17:40 - 17:43

has 16 going into 16 and then the output
17:43 - 17:47

layer 16 going into four and then the
17:47 - 17:49

forward function X is the training data
17:49 - 17:51

set and we're sending the training data
17:51 - 17:53

through the neuron Network again this is
17:53 - 17:57

pretty common pytorch
17:57 - 18:00

code next we need a class to represent
18:00 - 18:02

the replay
18:02 - 18:05

memory so it's this portion that we're
18:05 - 18:08

implementing right
18:08 - 18:12

now in the init function we'll pass in a
18:12 - 18:14

max length and then create the python
18:14 - 18:17

deck in the append function we're going
18:17 - 18:21

to append the transition the transition
18:21 - 18:24

is this tupo here State action new state
18:24 - 18:27

reward and
18:27 - 18:30

terminated these sample function will
18:30 - 18:33

return a random sample of whatever size
18:33 - 18:35

we want from the memory and then the
18:35 - 18:37

link function simply Returns the length
18:37 - 18:39

of the
18:39 - 18:42

memory the fen L dql class is where
18:42 - 18:44

we're going to do our
18:44 - 18:47

training we set the learning rate and
18:47 - 18:50

the discount Factor uh those are part of
18:50 - 18:53

the Q learning formula and these are
18:53 - 18:55

values that you can adjust and play
18:55 - 18:56

around
18:56 - 18:59

with the network sync rate it's the
18:59 - 19:01

number of steps the agent takes before
19:01 - 19:05

syncing the policy and Target
19:05 - 19:08

Network that's the setting for step 10
19:08 - 19:13

Where We sync the policy and the target
19:13 - 19:17

Network we set the replay memory size to
19:17 - 19:22

1,000 and the replay memory sample size
19:22 - 19:24

to 32 these are also numbers that you
19:24 - 19:28

can uh play around with next is the loss
19:28 - 19:31

function and the optimizer these two are
19:31 - 19:34

py torch variables for the loss function
19:34 - 19:36

I'm simply using the mean square error
19:36 - 19:38

function for the optimizer will
19:38 - 19:41

initialize that at a later time this
19:41 - 19:44

actions list is to Simply map the action
19:44 - 19:49

numbers into letters for
19:49 - 19:52

printing in the train function we can
19:52 - 19:54

specify how many episodes we want to
19:54 - 19:57

train the agent whether we want to
19:57 - 20:00

render the map on screen and what do we
20:00 - 20:02

want to turn on the slippery
20:02 - 20:06

flag we'll instantiate the frosen lake
20:06 - 20:09

environment and then create a variable
20:09 - 20:12

to store the number of states this is
20:12 - 20:15

going to be 16 inst store the number of
20:15 - 20:18

actions this is going to be four here we
20:18 - 20:21

initialize Epsilon to one and we
20:21 - 20:24

instantiate the replay memory this is
20:24 - 20:27

going to be size of uh 1,000 now we
20:27 - 20:31

create the policy DEQ
20:31 - 20:34

Network this is step one create the
20:34 - 20:36

policy
20:36 - 20:40

Network we also create a Target Network
20:40 - 20:43

and copy the weights and bias from the
20:43 - 20:46

policy Network into the target Network
20:46 - 20:48

so that
20:48 - 20:52

is Step number two making the Target and
20:52 - 20:53

policy Network
20:53 - 20:56

identical before training we'll do a
20:56 - 20:58

print out of the policy Network just so
20:58 - 21:02

we can compare it to the end result next
21:02 - 21:04

we initialize the optimizer that we
21:04 - 21:05

declared
21:05 - 21:09

earlier and we're simply using the atom
21:09 - 21:12

Optimizer passing in the learning rate
21:12 - 21:15

we'll use this rewards per episode list
21:15 - 21:18

to keep track of the rewards collected
21:18 - 21:18

per
21:18 - 21:21

episode we'll also use this Epsilon
21:21 - 21:24

history list to keep track of Epsilon
21:24 - 21:26

decaying over
21:26 - 21:29

time we'll use step count to keep track
21:29 - 21:31

of the number of steps taken this is
21:31 - 21:35

used to determine when to sync the
21:35 - 21:39

policy and Target Network again so that
21:39 - 21:41

is for step 10 the
21:41 - 21:45

syncing next we will Loop through the
21:45 - 21:47

number of episodes that were specified
21:47 - 21:49

when calling the train function we'll
21:49 - 21:52

initialize the map so the agent starts
21:52 - 21:55

at state zero we'll initialize the
21:55 - 21:58

terminated and truncated flag terminated
21:58 - 21:59

is when when the agent falls into the
21:59 - 22:02

hole or reaches the goal truncated is
22:02 - 22:05

when the agent takes more than 200
22:05 - 22:08

actions on such a small map 4x4 uh this
22:08 - 22:10

is probably never going to occur but
22:10 - 22:12

we'll keep this anyway now we have a
22:12 - 22:14

while loop checking for the terminated
22:14 - 22:18

and truncated flag keep looping until
22:18 - 22:20

these conditions are met this part is
22:20 - 22:23

basically the Epsilon greedy algorithm
22:23 - 22:27

if a random number is less than Epsilon
22:27 - 22:29

we'll just pick a random action
22:29 - 22:31

otherwise we'll use the policy Network
22:31 - 22:34

to calculate the set of Q values take
22:34 - 22:37

the maximum of that extract that item
22:37 - 22:39

and that's going to be the best action
22:39 - 22:42

so we have a function here called state
22:42 - 22:46

to dqn input let's jump down there
22:46 - 22:48

remember that we have to take the state
22:48 - 22:52

and encode it that's what this function
22:52 - 22:56

does this type of encoding
22:56 - 23:00

here okay let's go back
23:00 - 23:02

with P torch if we want to perform a
23:02 - 23:05

prediction we should call this torch. no
23:05 - 23:07

grads so that it doesn't calculate the
23:07 - 23:09

stuff needed for training after
23:09 - 23:12

selecting the either a random action or
23:12 - 23:14

the best action we'll call the step
23:14 - 23:17

function to execute the action when we
23:17 - 23:20

take that action it's going to return
23:20 - 23:22

the new state whether there is a reward
23:22 - 23:25

or not whether it's a terminal state or
23:25 - 23:27

we got truncated we take all that
23:27 - 23:29

information
23:29 - 23:33

and put it into our memory so that is
23:33 - 23:38

Step 3A when we did the Epsilon greedy
23:38 - 23:41

that was the step
23:41 - 23:45

three after executing the action we're
23:45 - 23:47

resetting the state equal to the new
23:47 - 23:50

state we increment our step counter if
23:50 - 23:54

we received an award put it on our list
23:54 - 23:56

now we'll check the memory to see if we
23:56 - 23:58

have enough training data to to do
23:58 - 24:02

optimization on also we want to check if
24:02 - 24:05

we have collected at least one reward if
24:05 - 24:07

we haven't collected any rewards there's
24:07 - 24:09

really no point in optimizing the
24:09 - 24:12

network if those conditions are met we
24:12 - 24:15

use memory. sample and we pass in uh the
24:15 - 24:18

batch size which was 32 and we get a
24:18 - 24:21

batch of training data out of the memory
24:21 - 24:23

we'll pass that training data into the
24:23 - 24:26

optimized function along with the policy
24:26 - 24:29

Network and the target Network
24:29 - 24:32

let's jump over to the optimize
24:32 - 24:35

function this first line is just looking
24:35 - 24:37

at the policy Network and getting the
24:37 - 24:39

number of input noes we expect this to
24:39 - 24:44

be 16 the current qist and Target
24:44 - 24:47

qist so this is the current
24:47 - 24:51

cus this is the target
24:51 - 24:55

cus what I'm about to describe now is
24:55 - 25:00

Step 3B rep playing the experience
25:00 - 25:02

so to replay the experience We're
25:02 - 25:05

looping through the training data inside
25:05 - 25:08

the mini batch let's jump down here for
25:08 - 25:12

a moment so this is Step number four
25:12 - 25:14

we're taking the states and passing it
25:14 - 25:17

into the policy Network to calculate the
25:17 - 25:19

current list of Q
25:19 - 25:23

values step number four the output is
25:23 - 25:25

the list of Q
25:25 - 25:28

values step number five we pass in the
25:28 - 25:31

same thing to the Target
25:31 - 25:35

Network step five we get the Q values
25:35 - 25:38

out here which should be the same as the
25:38 - 25:40

Q values from the policy
25:40 - 25:44

Network step number six is up here so
25:44 - 25:47

reminder of what it looks like step six
25:47 - 25:50

is using this formula
25:50 - 25:53

here so that's what we have here if
25:53 - 25:56

terminated if terminated just return the
25:56 - 25:58

reward otherwise use that that second
25:58 - 26:01

formula so now that we have a Target
26:01 - 26:04

we'll go to step seven step seven is
26:04 - 26:06

taking the output of Step six and
26:06 - 26:09

replacing the respective Q
26:09 - 26:12

value and that's what we're doing here
26:12 - 26:15

we're replacing the Q value of that
26:15 - 26:17

particular action with the Target that
26:17 - 26:19

was calculated up
26:19 - 26:23

above step number eight step eight is to
26:23 - 26:25

take the target values and use that to
26:25 - 26:29

train the current Q values
26:29 - 26:32

so that's what we have here we're using
26:32 - 26:35

the loss function pass in the current
26:35 - 26:38

set of Q values plus the target set of Q
26:38 - 26:41

values and then this is just standard Pi
26:41 - 26:44

torch code to optimize the policy
26:44 - 26:49

Network now step nine step nine is to
26:49 - 26:52

repeat steps 3 to 8 starting from
26:52 - 26:56

navigation to optimizing the
26:57 - 26:59

network
26:59 - 27:01

basically that's just continuing this
27:01 - 27:04

inner while loop of stepping through the
27:04 - 27:06

states and this outer for Loop of
27:06 - 27:09

stepping through each
27:09 - 27:12

episode step 10 is where we sync the
27:12 - 27:16

policy Network and the target
27:17 - 27:20

Network and we do that down here if the
27:20 - 27:23

number of steps taken is greater than
27:23 - 27:25

the network sync rate that we set then
27:25 - 27:28

we copy the policy Network into the
27:28 - 27:30

target Network and then we reset the
27:30 - 27:33

step counter also after each episode we
27:33 - 27:37

should be decaying the Epsilon value
27:37 - 27:39

after syncing the
27:39 - 27:43

network we repeat Step n again which is
27:43 - 27:45

repeating basically the navigation and
27:45 - 27:47

training
27:47 - 27:50

again so it's just going to go back up
27:50 - 27:53

here and do this all over again all the
27:53 - 27:56

way until we have finished the number of
27:56 - 27:58

episodes so that was training after
27:58 - 28:01

training we close the environment we can
28:01 - 28:04

save the policy or the weights and bias
28:04 - 28:07

into a file I'm hardcoding it here you
28:07 - 28:10

can definitely make this more Dynamic
28:10 - 28:13

here I'm creating a new graph and I'm
28:13 - 28:16

basically graphing the rewards collected
28:16 - 28:18

per episode also I'm graphing the
28:18 - 28:21

Epsilon history here after graphing I'm
28:21 - 28:24

saving the graphs into an image all
28:24 - 28:27

right so that was the train function we
28:27 - 28:29

have the talked about the optimize
28:29 - 28:31

function let me fold
28:31 - 28:35

that we already looked at that now the
28:35 - 28:38

test function is going to run the fren
28:38 - 28:40

Lake environment with the policy that we
28:40 - 28:43

learned from the train function we can
28:43 - 28:46

also pass in the number of episodes and
28:46 - 28:47

whether we want to turn on slippery or
28:47 - 28:50

not so the rest of the code is going to
28:50 - 28:52

look pretty similar to what we had in
28:52 - 28:55

the train function we instantiate the
28:55 - 28:57

environment get the number of states and
28:57 - 29:01

action s 16 and 4 here we declar the
29:01 - 29:04

policy Network load from the file that
29:04 - 29:07

we saved from training this is just P
29:07 - 29:09

torch code to switch the policy Network
29:09 - 29:12

to prediction mode or evaluation mode
29:12 - 29:14

rather than training mode we'll print
29:14 - 29:17

the train policy and then we'll Loop
29:17 - 29:20

over the episodes we set the agent up
29:20 - 29:22

top you've seen this before keep looping
29:22 - 29:25

until the agent gets terminated or
29:25 - 29:28

truncated here we're selecting the best
29:28 - 29:31

action out of the policy Network and
29:31 - 29:33

executing the action and then close the
29:33 - 29:36

environment so ideally when we run this
29:36 - 29:38

we're going to see the agent navigate
29:38 - 29:40

the map and reach the goal however if
29:40 - 29:43

slippery is on there is no guarantee
29:43 - 29:45

that the agent is going to solve it in
29:45 - 29:47

one try it might take a couple of tries
29:47 - 29:50

for the agent to solve the
29:50 - 29:53

map all right down at the main function
29:53 - 29:56

we create an instance of the fen Lake
29:56 - 29:58

dql class first first we're going to try
29:58 - 30:01

non-slippery we'll train it for a th
30:01 - 30:04

times or 1 th000 episodes we'll run the
30:04 - 30:06

test four times just because the map
30:06 - 30:08

pops up and goes away when the agent
30:08 - 30:11

gets to the goal so um we want to run it
30:11 - 30:12

a couple times so we actually can see it
30:12 - 30:14

on the screen okay I'm just hitting
30:14 - 30:17

contrl
30:19 - 30:22

F5 all right training is done and looks
30:22 - 30:26

like the agent is able to solve the map
30:26 - 30:28

we also have a print out of the policy
30:28 - 30:30

that it's learned I print it in a way
30:30 - 30:33

that matches up with the grid let's jump
30:33 - 30:36

back to this in a second let's go to our
30:36 - 30:41

files so after training we have a new
30:41 - 30:45

graph on the left side that's the number
30:45 - 30:48

of rewards collected over the 1,000
30:48 - 30:51

episodes as you can see over time it's
30:51 - 30:54

improving and finally at the end it's
30:54 - 30:56

getting a lot of rewards on the right
30:56 - 30:59

side we can see Epsilon decaying
30:59 - 31:03

starting from one slowly slowly down to
31:03 - 31:06

zero also another file that gets created
31:06 - 31:09

is it's a binary file so we can't really
31:09 - 31:11

display it but it contains the weights
31:11 - 31:14

and biases of the policy
31:14 - 31:17

Network so if we want to do the test
31:17 - 31:19

again we don't need the training we can
31:19 - 31:22

comment out the training line and just
31:22 - 31:24

run this
31:24 - 31:26

again and we can see the agent solve the
31:26 - 31:29

map
31:32 - 31:34

we can look at what policy the agent
31:34 - 31:37

learned the way this is printed it
31:37 - 31:41

matches up with the map so at state zero
31:41 - 31:44

the best action was to go right which is
31:44 - 31:47

State one at State one the best action
31:47 - 31:51

is going to go right again at state two
31:51 - 31:55

go down at St six go down again at St 10
31:55 - 31:58

go down again at State 14
31:58 - 32:00

go to the right
32:00 - 32:05

okay so it was right right down down
32:05 - 32:08

down right so remember earlier that
32:08 - 32:10

there were two possible path the one
32:10 - 32:13

that we actually learned plus the one
32:13 - 32:16

going down because Epsilon greedy has a
32:16 - 32:18

Randomness Factor whether the agent
32:18 - 32:21

learns the top path or the bottom path
32:21 - 32:24

is just somewhat random if weet train
32:24 - 32:29

again maybe it'll go down the next time
32:30 - 32:34

now let's turn the slippery Factor
32:34 - 32:37

on uncomment the training
32:37 - 32:41

line now with slippery on we expect the
32:41 - 32:44

agent to fail a lot more often or to
32:44 - 32:46

fall in the holes a lot more often let
32:46 - 32:50

me just triple the number of um
32:50 - 32:55

training and let me do 10 times for
32:55 - 32:58

testing okay I'm running this again
32:58 - 33:00

again with slippery Turn on there's no
33:00 - 33:02

guarantee that after training the agent
33:02 - 33:05

is going to be able to pass every
33:05 - 33:08

episode let's make it come
33:08 - 33:11

up you see it trying to go to the bottom
33:11 - 33:13

right but because of slippery it just
33:13 - 33:16

failed right there but there you go it
33:16 - 33:19

was able to solve it let's look at the
33:19 - 33:22

graph the results compared to the
33:22 - 33:24

non-slippery surface is significantly
33:24 - 33:27

worse also you want to be careful
33:27 - 33:30

getting stuck in spots like these which
33:30 - 33:34

means if I was unlucky enough to set my
33:34 - 33:37

training episodes to maybe 2200 and it
33:37 - 33:40

gets stuck here which means it learned a
33:40 - 33:42

bad policy it won't be able to solve the
33:42 - 33:44

map so in your training if you're
33:44 - 33:46

finding that the agent is not able to
33:46 - 33:49

solve the map when slippery is turned on
33:49 - 33:51

it might be because of things like these
33:51 - 33:54

okay that concludes our deep Q learning
33:54 - 33:57

tutorial I love to hear feedback from
33:57 - 34:00

you was the explanation easy to
34:00 - 34:02

understand what can I improve on and
34:02 - 34:06

what other topics are you interested in

Title:: Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning
Description:: more » « less
Video Language:: English
Duration:: 34:05

	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning

English subtitles

Revisions Compare revisions

Revision 4 Edited

OEVIDEOS
Revision 3 Edited

OEVIDEOS
Revision 2 Edited

OEVIDEOS
Revision 1 Uploaded

OEVIDEOS

	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning
	OEVIDEOS edited English subtitles for Deep Q-Learning/Deep Q-Network (DQN) Explained \| Python Pytorch Deep Reinforcement Learning

	Revision Number	Author	Created
	4	OEVIDEOS
	3	OEVIDEOS
	2	OEVIDEOS
	1	OEVIDEOS

Deep Q-Learning/Deep Q-Network (DQN) Explained | Python Pytorch Deep Reinforcement Learning

Revisions Compare revisions

Our website uses cookies

Operating cookies (Required)