-
NARRATOR: Hi! In this video,
-
we are going to be talking about
graphical data summary
-
for numerical data.
-
We are going to be, um,
-
using the iris dataset,
-
for this.
-
And the first graphical data summary
we're going to talk about are histograms.
-
So, if you remember from the iris dataset,
-
there were four different
numerical variables you could choose from.
-
Sepal length, sepal width,
petal length and petal width.
-
And I'm going to use
the petal length, um, variable today.
-
So, to create a histogram of a--
of a numeric var-variable,
-
you will use the function, "hist()."
-
And then, you can just go ahead and put in
-
whatever data you want
to produce a histogram for.
-
Since this is already
a numerical variable,
-
it's good to go!
-
All you need is just this.
-
And here you can see a histogram,
-
of all the various petal lengths.
-
There's a lot of them that are
in between one and one and a half,
-
and, yknow, fair amount over here.
-
Definitely not really
normally distributed,
-
not like a bell curve.
-
Kind of, got a couple of peaks.
-
So, if you want to customize
your...histogram,
-
here are some additional arguments
that you can put in, to make it,
-
more pretty, I guess.
[chuckles]
-
So first one is,
similar to categorical data
-
—when we were doing the graphical.
-
You can create a title.
-
So there is a default title here,
"Histogram of..."
-
But if you don't really want
this R syntax to be showing,
-
you can totally change it.
-
So...
-
Let's do...
-
"Histogram of Iris Petal Lengths."
-
There we go, much better.
-
We can also, um, add,
-
x-labels for the x-axis,
-
so if we didn't want
to have that there, as well.
-
You can also change the y-label,
-
but I'm going to keep it just like that.
-
So, and you know,
these are all available to you,
-
but you don't have to change all of them.
-
Just kind of whatever you feel, so.
-
I'm going to call this, "Petal Length."
-
There we go, looks much better down here.
-
You can change the limits
of your x-axis and your y-axis.
-
So obviously, with the Y,
-
it's not quite reaching
all the way up there.
-
So I'm going to change that.
-
And then, I like to always have
a little buffer on the sides, too,
-
so maybe we could have it
go from zero to eight or something.
-
So for the x-axis, I'm going to change
my limits to go from 0 to 8.
-
And my y-axis, I'm going to have
the axis go from 0 to 60.
-
Now,
-
you just kind of just play around
and find out which one looks good,
-
and kind of when you figured out
what you like the best.
-
There really is no right or wrong, um,
-
magic number.
-
Just kind of--to make it look nice,
is all we're really kind of doing.
-
Alright, so now you can see here.
-
Maybe 60 was a little too much, but.
-
Eh, it works.
-
But now you can see there's a side,
buffers on the side.
-
You can see it easily,
where it easily starts and ends.
-
So, I probably actually might take
that in a little more
-
and squish that down
a little more, but, y'know.
-
For our purposes, it's probably just fine.
-
There is another argument that you can do
that's new that's called "breaks,"
-
where "breaks" kind of specifies
a suggested number of bins.
-
It can't always perfectly say like,
-
so if I want to say breaks--
-
And by bins,
I'm meaning how many, like,
-
boxes it's using and how many--
to create the histogram.
-
So if I were to say like, eight,
-
it's going to try and do
its very best to create it
-
so that there's only eight boxes.
-
So there's 1, 2...
-
—there would technically be 3—
4, 5, 6, 7, 8, 9, 10, 11, 12, right now.
-
Let's see how well it can do with eight.
-
If you specify seven,
it might do six or eight,
-
just kind of depending on how the data is.
-
it does its very best to match
whatever number you put there.
-
So you see here,
we got one, two, three, four, five, six.
-
So yeah, so it--
-
It definitely made it so that the shape
of the graphic looks different.
-
The, y'know the--bars are wider.
-
But it didn't quite do it
exactly as eight.
-
So that one is, uh.
-
You just kind of
have to do your very best.
-
If you want to learn more
about how you can kind of be more specific
-
on how you can set the actual breakpoints,
-
you should take
Statistical Visualization.
-
Or, you can research how to do that.
-
Or you can use the "ggplot" package,
and that's pretty great as well.
-
Another thing you can do is,
-
you can say labels is equal to true.
-
And what this will do is that
it will put the counts in of--
-
in each category on top of the bars.
-
So you can see that there were
50 in between one and two,
-
1 between two and three,
15 between three and four, etcetera.
-
Another...
And then the last argument,
-
—oh actually, the second-to-last argument
is you can always change the color.
-
And I'm going to pick
a color called sky blue.
-
Doesn't that look pretty?
-
Alright!
-
The one other thing we will talk about
-
is sometimes, you could have
a frequency histogram
-
where you just say how many were
in between one and two, two and three
-
and put them in there.
-
Or you can do a, um,
-
based off of kind of,
probability densities.
-
So meaning, this is, um...
-
probably, y'know,
what proportion of the data lies in here.
-
So instead of having these
be frequency numbers,
-
they would be, like,
proportion of the whole.
-
So, y'know, this has got quite a bit,
-
so this could be like, y'know,
almost a third of the data.
-
Or a quarter of the data
could be all in just this section.
-
And so you can also make
a histogram be like that as well.
-
So I'm going to...
-
copy this.
-
And make it be very similar,
-
but I'm going to kind of just
say it's a density function now.
-
I'm going to temporarily take out
the "ylim" argument.
-
And then we'll just kind of see...
-
how that works right now.
-
And then the only other argument
you would need to change
-
is just say "freq," or frequency, F-R-E-Q.
-
Say it's equal to false, meaning,
do not make it a frequency table.
-
And see what that does.
-
So now you can see
that this is about a third of the data.
-
Oh, it's getting cut off a little bit
-
'cause we'll wanna change our Y limits.
-
Notice it's not going from 0 to 60,
-
and because these are all now
percentages or proportions.
-
So, I'm going to make my Y limit be...
-
From zero to one which is the max
that it kind of can be, so.
-
Yeah that was probably a little too high,
but I mean it still gets to point across.
-
So as you can see, almost a third
of the data is all contained here.
-
Then we've got, 0.7% of the data is here.
-
If you were to add up
all of these numbers together,
-
it would sum to one.
-
Just like the, um.
-
A probability distribution function.
-
The area under all of its curve,
-
in its--in it's, kind of,
in its range is, um,
-
needs to sum up to one.
-
So this is a way that you can kind of
show that, with your own data.
-
Now, if you want to create a histogram
of the petal length variable,
-
but remember that there were
three different species as well.
-
There was setosa,
versicolor, and virginica.
-
If you wanted to show a histogram
of just the petal length for like,
-
just the setosa variable,
[stammers] setosa species, excuse me.
-
There isn't a quick and easy way
to do that.
-
You just kind of have to create
a separate histogram for each, so.
-
Say like, you wanted a histogram
of just the petal length for setosas,
-
just for virginicas,
and just for versicolors.
-
This is how we can do that.
-
I'm going to copy and paste
some code that I have,
-
just so you can see.
-
So, histograms
for the petal length variable,
-
I didn't quite do all of the different
specifications and customizations,
-
but I did enough to make it look good.
-
So this is what it looks like
for all of the petal lengths,
-
not for separated by species.
-
If you want to separate them out
by species, you can do that
-
by using bracket notation
and logical operators.
-
So bracket notation
is how you subset data.
-
So, what we're doing here
is we're saying,
-
"I want the petal length—the variable,
the numerical variable—to be plotted,
-
but I only want the values
that meet a certain criteria.
-
And the criteria is, we want it--
-
the only the ones
where the species is equal to setosa.
-
So this will now plot a histogram
of all of the petal lengths,
-
but only for the ones that are setosas.
-
So what's interesting here
is you can see that
-
all of the setosas have
really small petal lengths.
-
And I do the exact same thing, um, here.
-
I just change each of these
to versicolor and virginica.
-
And the colors, too.
-
And I have it so that they're all
on the same scale as well,
-
so that it's easier to compare.
-
That's also really important
to make sure you do.
-
If you're going to make
comparative histograms,
-
make sure they're all on the same scale.
-
Here's what versicolor would look like.
-
So the versicolor ones are all right here.
-
And the virginicas...are--
are the biggest of the petal lengths.
-
So as you can kind of see here,
they start at about 4.5.
-
These start--they have some in the 4.5,
but it's definitely in this middle range.
-
And then the petal--
the setosas are all the way over here.
-
And so if you compare that to this,
this is obviously the setosas.
-
These are the versicolors,
and these are kind of more the virginicas.
-
There is a way to also, um,
-
in uh, to possi--
-
in other packages and things in R
for you to even specify it even further.
-
So you could maybe even show that,
this is setosa, versicolor, virginica.
-
But for right now,
-
we're just gonna show you
how you can do each one separately.
-
This is probably the easiest way
to go about doing that.
-
Alright, that is everything, um,
-
you should need to know
about creating histograms
-
for your numerical data in R.
-
And we will see you in the next video.