Visual Guide to Random Forests

Edit subtitles

0:00 - 0:04
0:04 - 0:06

TEACHER: One of the most
deceptively obvious questions
0:06 - 0:11

in machine learning is, are more
models better than fewer models?
0:11 - 0:13

The science that
answers this question
0:13 - 0:16

is called model ensembling.
0:16 - 0:19

Model ensembling asks how
to construct aggregations
0:19 - 0:23

of models that improve test
accuracy while reducing
0:23 - 0:27

the costs associated with
storing, training, and getting
0:27 - 0:30

inference from multiple models.
0:30 - 0:32

We'll explore a popular
ensembling method
0:32 - 0:34

applied to decision trees.
0:34 - 0:36

Random forests.
0:36 - 0:39

In order to illustrate random
forests, let's take an example.
0:39 - 0:43

Imagine we're trying to predict
what caused a wildfire given
0:43 - 0:47

its size, location, and date.
0:47 - 0:49

The basic building blocks
of the random forest model
0:49 - 0:51

are decision trees.
0:51 - 0:53

So if you want to
learn how they work,
0:53 - 0:56

I recommend checking out our
previous video linked here.
0:56 - 0:58

As a quick refresher,
decision trees
0:58 - 1:02

perform the task of
classification or regression
1:02 - 1:05

by recursively asking simple
true-or-false questions that
1:05 - 1:09

split the data into the
purest possible subgroups.
1:09 - 1:11

Now, back to random forests.
1:11 - 1:14

In this method of
ensembling, we train
1:14 - 1:17

a bunch of decision trees,
hence the name forest,
1:17 - 1:20

and then take a vote among
the different trees--
1:20 - 1:22

one tree, one vote.
1:22 - 1:25

In the case of
classification, each tree
1:25 - 1:27

spits out a class
prediction, and then
1:27 - 1:29

the class with the most
votes becomes the output
1:29 - 1:31

of the random forest.
1:31 - 1:34

In the case of regression,
a simple average
1:34 - 1:35

of each individual
tree's prediction
1:35 - 1:38

becomes the output
of the random forest.
1:38 - 1:40

The key idea behind
random forests
1:40 - 1:42

is that there's
wisdom in crowds.
1:42 - 1:45

Insight drawn from a
large group of models
1:45 - 1:48

is likely to be more accurate
than the prediction from any one
1:48 - 1:50

model alone.
1:50 - 1:52

Sounds simple enough, right?
1:52 - 1:52

Sure.
1:52 - 1:54

But why does this even work?
1:54 - 1:58

What if all of our models
learn the exact same thing
1:58 - 2:00

and vote for the same answer?
2:00 - 2:02

Isn't that equivalent
to just having
2:02 - 2:04

one model make the prediction?
2:04 - 2:07

Yes, but there's
a way to fix that.
2:07 - 2:11

First, we need to define a
word that will help explain,
2:11 - 2:12

uncorrelatedness.
2:12 - 2:16

We need our decision trees to
be different from each other.
2:16 - 2:18

We want them to disagree
on what the splits are
2:18 - 2:20

and what the predictions are.
2:20 - 2:24

Uncorrelatedness is
important for random forests.
2:24 - 2:26

A large group of
uncorrelated trees,
2:26 - 2:29

working together in an
ensemble, will outperform
2:29 - 2:32

any of the constituent trees.
2:32 - 2:35

In other words, the forest
is shielded from the errors
2:35 - 2:37

of individual trees.
2:37 - 2:41

So how do we ensure our
trees are uncorrelated?
2:41 - 2:43

There are a few different
methods to do this.
2:43 - 2:45

As you learn these
methods, try and see
2:45 - 2:49

if you understand what makes
a random forest random.
2:49 - 2:51

The first method to
ensure uncorrelatedness
2:51 - 2:53

is called bootstrapping.
2:53 - 2:56

Bootstrapping is
creating smaller data
2:56 - 2:59

sets out of our training
data set through sampling.
2:59 - 3:02

Now, with normal
decision trees, we
3:02 - 3:05

feed the entire training
data set to the tree
3:05 - 3:07

and allow it to
generate its prediction.
3:07 - 3:10

However, with bootstrapping,
we allow each tree
3:10 - 3:13

to randomly sample a
subset of the training
3:13 - 3:17

data with replacement,
resulting in different trees.
3:17 - 3:20

When we allow replacement,
some observations
3:20 - 3:24

may be repeated in the sample.
3:24 - 3:28

In our data set, we have
1.88 million wildfires,
3:28 - 3:32

but we're only going to show,
say, a random 25% subset
3:32 - 3:35

of those to each of our trees.
3:35 - 3:38

As a result, these two trees
sampled from the same data
3:38 - 3:42

set, but ended up with two
very different training sets.
3:42 - 3:45

Using bootstrapping to
create uncorrelated models
3:45 - 3:48

and then aggregating
their results
3:48 - 3:51

is called bootstrap aggregating,
or bagging for short.
3:51 - 3:54

The second way to introduce
variation in our trees
3:54 - 3:58

is by shuffling, which features
each tree can split on.
3:58 - 4:01

This method is called
feature randomness.
4:01 - 4:03

Remember, with basic
decision trees,
4:03 - 4:06

when it's time to split
the data on a node,
4:06 - 4:08

the tree considers
each possible feature
4:08 - 4:11

and picks the one that leads
to the purest subgroups.
4:11 - 4:15

However, with random forests,
we limit the number of features
4:15 - 4:18

that each tree can even
consider splitting on.
4:18 - 4:22

For example, consider
the two trees shown here.
4:22 - 4:25

The first one only sees
the location and size
4:25 - 4:27

of the wildfire, while
the second one only
4:27 - 4:30

sees size and date.
4:30 - 4:34

As a result, the two trees
learn very different splits.
4:34 - 4:38

Feature randomness
encourages diverse trees.
4:38 - 4:41

Because the individual
trees are very simple,
4:41 - 4:43

and they're only trained on
a subset of the training data
4:43 - 4:46

and feature set, training
time is very low,
4:46 - 4:50

so we can afford to
train thousands of trees.
4:50 - 4:53

Random forests are widely
used in academia and industry.
4:53 - 4:56

Now that you
understand the concept,
4:56 - 4:58

you're almost ready to
implement a random forest model
4:58 - 5:01

to use with your own projects.
5:01 - 5:04

Stay tuned to Econoscent for the
random forest coding tutorial
5:04 - 5:08

and for a new video on yet
another ensembling method,
5:08 - 5:10

gradient boosted trees.
5:10 - 5:11

Title:: Visual Guide to Random Forests
Description:: more » « less
Video Language:: English
Duration:: 05:12

TTU_OAL edited English subtitles for Visual Guide to Random Forests

English subtitles

Revisions

Revision 1 Uploaded

TTU_OAL

Visual Guide to Random Forests

Revisions

Our website uses cookies

Operating cookies (Required)