-
NARRATOR: Hi! In this video,
-
we are going to be talking about
graphical data summary
-
for numerical data.
-
We are going to be, um,
-
using the iris dataset,
-
for this.
-
And the first graphical data summary
we're going to talk about are histograms.
-
So, if you remember from the iris dataset,
-
there were four different
numerical variables you could choose from.
-
Sepal length, sepal width,
petal length and petal width.
-
And I'm going to use
the petal length, um, variable today.
-
So, to create a histogram of a--
of a numeric var-variable,
-
you will use the function, "hist()."
-
And then, you can just go ahead and put in
-
whatever data you want
to produce a histogram for.
-
Since this is already
a numerical variable,
-
it's good to go!
-
All you need is just this.
-
And here you can see a histogram,
-
of all the various petal lengths.
-
There's a lot of them that are
in between one and one and a half,
-
and, yknow, fair amount over here.
-
Definitely not really
normally distributed,
-
not like a bell curve.
-
Kind of, got a couple of peaks.
-
So, if you want to customize
your...histogram,
-
here are some additional arguments
that you can put in, to make it,
-
more pretty, I guess.
[chuckles]
-
So first one is,
similar to categorical data
-
—when we were doing the graphical.
-
You can create a title.
-
So there is a default title here,
"Histogram of..."
-
But if you don't really want
this R syntax to be showing,
-
you can totally change it.
-
So...
-
Let's do...
-
"Histogram of Iris Petal Lengths."
-
There we go, much better.
-
We can also, um, add,
-
x-labels for the x-axis,
-
so if we didn't want
to have that there, as well.
-
You can also change the y-label,
-
but I'm going to keep it just like that.
-
So, and you know,
these are all available to you,
-
but you don't have to change all of them.
-
Just kind of whatever you feel, so.
-
I'm going to call this, "Petal Length."
-
There we go, looks much better down here.
-
You can change the limits
of your x-axis and your y-axis.
-
So obviously, with the Y,
-
it's not quite reaching
all the way up there.
-
So I'm going to change that.
-
And then, I like to always have
a little buffer on the sides, too,
-
so maybe we could have it
go from zero to eight or something.
-
So for the x-axis, I'm going to change
my limits to go from 0 to 8.
-
And my y-axis, I'm going to have
the axis go from 0 to 60.
-
Now,
-
you just kind of just play around
and find out which one looks good,
-
and kind of when you figured out
what you like the best.
-
There really is no right or wrong, um,
-
magic number.
-
Just kind of--to make it look nice,
is all we're really kind of doing.
-
Alright, so now you can see here.
-
Maybe 60 was a little too much, but.
-
Eh, it works.
-
But now you can see there's a side,
buffers on the side.
-
You can see it easily,
where it easily starts and ends.
-
So, I probably actually might take
that in a little more
-
and squish that down
a little more, but, y'know.
-
For our purposes, it's probably just fine.
-
There is another argument that you can do
that's new that's called "breaks,"
-
where "breaks" kind of specifies
a suggested number of bins.
-
It can't always perfectly say like,
-
so if I want to say breaks--
-
And by bins,
I'm meaning how many boxes,
-
it's using and how many
to create the histogram.
-
So if I were to say like eight,
-
it's going to try and do
its very best to create it,
-
so there's only eight boxes.
So there's one, two,
-
There would technically be three,
4, 5, 6, 7, 8, 9, 10, 11, 12,
-
right now, let's see
how well it can do with eight.
-
If you specify seven
it might do six or eight,
-
just kind of depending on how the data is,
it does its very best to match,
-
whatever number you put there.
-
So you see here we got
one, two, three, four, five, six.
-
So yeah, So it definitely
made it so that the shape,
-
of the graphic looks different.
The you know, the bars are wider.
-
But it didn't quite do exactly as eight.
-
So that one is,
you just kind of have to do your very best.
-
If you want to learn more
about how you can kind of be,
-
more specific on how you
can set the actual breakpoints,
-
you should take
statistical visualization,
-
or, you can research how to do that,
-
or you can use the "ggplot" package
And that's pretty great as well.
-
Another thing you can do is
you can say labels is equal to true,
-
and what this will do is that it
will put the counts in,
-
in each category on top of the bars.
-
So you can see that there were
50 in between one and two,
-
one between two and three,
15 between three and four, etc...
-
Another, And then the last argument,
-
or actually the second to last argument,
is you can always change the color.
-
And I'm going to pick
a color called sky blue.
-
Doesn't that look pretty.
Alright,
-
the one other thing we will talk about
is sometimes you could have,
-
a frequency histogram
where you just say how many were,
-
in between one and two,
two and three and put them in there.
-
Or you can do a, based off
of kind of probability densities.
-
So meaning this is,
-
probably, you know,
what proportion of the data lies in here.
-
So instead of having these be,
-
frequency numbers they would be like
proportion of the whole.
-
So you know this has got quite a bit.
-
So this could be like, you know,
almost a third of the data,
-
or a quarter of the data could be
all in just this section.
-
And so you can also make a
histogram be like that as well.
-
So I'm going to,
-
copy this,
-
and make it be very similar.
-
But I'm going to kind of just
say it's a density function now.
-
I'm going to temporarily
take out the "y" lim argument.
-
And then we'll kind of see,
-
how that works right now.
-
And then the only other argument you
would need to change is just say freq,
-
or frequency "f" "r" "e" "q",
-
say it's equal to false meaning
do not make it a frequency table,
-
and see what that does.
-
So now you can see that
this is about a third of the data.
-
Oh it's getting cut off a little bit because
we'll want to change our "y" limits.
-
Notice it's not going from zero to 60,
-
and because these are all now
percentages or proportions.
-
So I'm going to make my "y" limit be,
-
From zero to one which is the next
that it kind of can be so.
-
Yeah that was probably a little too high
but I mean it still gets to point across.
-
So as you can see, almost
a third of the data is all contained here.
-
then we've got, 0.7% of the data is here.
-
If you were to add up all of these numbers
together, it would sum to one.
-
Just like the, a probability
distribution function.
-
The area under all of its curve,
-
and it's in, it's kind of, in
its range is, needs to sum up to one.
-
So this is a way that you can kind of
show that, with your own data.
-
Now if you want to, create
a histogram of the petal length variable,
-
but remember that there were three
different species as well.
-
There was setosa,
versicolor and virginica.
-
If you wanted to show a histogram
of just the petal length for like,
-
just the setosa variable
of the setosa species,
-
excuse me, there isn't a
quick and easy way to do that.
-
You just kind of have to create
a separate histogram for each. So,
-
say, like you wanted a histogram
of just the petal length,
-
for setosas just for virginicas
and just for versicolors.
-
This is how we can do that.
-
I'm going to copy and paste
some code that I have,
-
just so you can see.
-
So histograms for the
petal length variable,
-
I didn't quite do all of the different
specifications and customizations,
-
but I did enough to make it look good.
-
So this is what it looks like
for all of the petal lengths,
-
not for separated by species.
-
If you want to separate them out
by species, you can do that,
-
by using bracket notation
and logical operators.
-
So bracket notation
is how you subset data.
-
So what we're doing here is we're
saying I want the petal length,
-
the variable, the numerical
variable to be plotted.
-
But I only want the values
that meet a certain criteria.
-
And the criteria is we want it,
-
the only the ones where the
species is equal to setosa.
-
So this will now plot a,
-
histogram of all of the petal lengths,
but only for the ones that are setosas.
-
So what's interesting here is
you can see that,
-
all of the setosas have
really small petal lengths.
-
And I do the exact same thing,
-
here, I just change each of these
to versicolor and virginica,
-
and the colors to.
-
And I have it so that they're all
on the same scale as well,
-
so that it's easier to compare.
-
That's also really important
to make sure you do,
-
if you're going to make comparative
histograms make sure they're all,
-
on the same scale.
-
Here's what versicolor would look like.
-
So the versicolor ones are all right here.
-
And the virginicas, are the
biggest of the petal lengths.
-
So as you can kind of see
here they start at about 4.5.
-
These start they have some in the 4.5,
but it's definitely in this middle range.
-
And then the petal,
the setosas are all the way over here.
-
And so if you compare that to this
this is obviously the setosas.
-
These are the versicolors and
these are kind of more the virginicas.
-
There is a way to also,
-
in to poss-
-
in other packages and things in "R"
for you to even specify it even further.
-
So you could maybe even show that
this is setosa, versicolor, virginica.
-
But for right now we're just going to show
you how you can do each one separately.
-
This is probably the easiest way
to go about doing that.
-
Alright, that is everything,
-
You should need to know about creating
histograms for your numerical data in "R".
-
And we will see you in the next video.