-
NARRATOR: Hi! In this video,
-
we are going to be talking about
graphical data summary
-
for numerical data.
-
We are going to be, um,
-
using the iris dataset,
-
for this.
-
And the first graphical data summary
we're going to talk about are histograms.
-
So, if you remember from the iris dataset,
-
there were four different
numerical variables you could choose from.
-
Sepal length, sepal width,
petal length and petal width.
-
And I'm going to use
the petal length, um, variable today.
-
So, to create a histogram of a--
of a numeric var-variable,
-
you will use the function, "hist()."
-
And then, you can just go ahead and put in
-
whatever data you want
to produce a histogram for.
-
Since this is already
a numerical variable,
-
it's good to go!
-
All you need is just this.
-
And here you can see a histogram,
-
of all the various petal lengths.
-
There's a lot of them that are
in between one and one and a half,
-
and, yknow, fair amount over here.
-
Definitely not really
normally distributed,
-
not like a bell curve.
-
Kind of, got a couple of peaks.
-
So, if you want to customize
your...histogram,
-
here are some additional arguments
that you can put in, to make it,
-
more pretty, I guess.
[chuckles]
-
So first one is,
similar to categorical data.
-
When we were doing the graphical
you can create a title,
-
so there is a default title
here, histogram of.
-
But if you don't really want this,
our syntax to be showing,
-
you can totally change it. So,
-
let's do,
-
Histogram of iris petal lengths.
-
There we go, much better.
-
We can also, add,
"x" labels for the "x" axis.
-
So if we didn't want to have that there as
well, you can also change the "y" label,
-
but I'm going to keep it just like that.
-
So, and you know,
these are all available to you,
-
but you don't have to change all of them,
just kind of whatever you feel so,
-
I'm going to call this petal length.
-
There we go, looks much better down here.
-
You can change the limits of
your "x" axis and your "y" axis.
-
So obviously with the "y" it's
not quite reaching all the way up there.
-
So I'm going to change that,
-
and then I like to always have
a little buffer on the sides too.
-
So maybe we could have it
go from zero to eight or something.
-
So for the "x" axis I'm going to change,
my limits to go from zero to eight.
-
And my "y" axis I'm going to have the
axis go from, zero to 60.
-
Now it's kind of just play around
and find out which one looks good,
-
and kind of when you figured out
what you like the best.
-
There really is no right or wrong,
magic number,
-
just kind of, to make it look nice is all,
we're really kind of doing.
-
Alright, So now you can see here maybe 60
was a little too much, but, eh it works.
-
But now you can see there's buffers
on the side.
-
You can see it easily,
where it easily starts and ends.
-
So I probably actually might take that in
a little more and squish that down,
-
a little more but, you know, for our
purposes it's probably just fine.
-
There is another,
-
argument that you can do
that's new, that's called breaks.
-
Where breaks kind of specifies
a suggested number of bins,
-
It can't always perfectly say,
-
like so if I want to say breaks, and by
bins i'm meaning how many like boxes,
-
it's using and how many
to create the histogram.
-
So if I were to say like eight,
-
it's going to try and do
its very best to create it,
-
so there's only eight boxes.
So there's one, two,
-
There would technically be three,
4, 5, 6, 7, 8, 9, 10, 11, 12,
-
right now, let's see
how well it can do with eight.
-
If you specify seven
it might do six or eight,
-
just kind of depending on how the data is,
it does its very best to match,
-
whatever number you put there.
-
So you see here we got
one, two, three, four, five, six.
-
So yeah, So it definitely
made it so that the shape,
-
of the graphic looks different.
The you know, the bars are wider.
-
But it didn't quite do exactly as eight.
-
So that one is,
you just kind of have to do your very best.
-
If you want to learn more
about how you can kind of be,
-
more specific on how you
can set the actual breakpoints,
-
you should take
statistical visualization,
-
or, you can research how to do that,
-
or you can use the "ggplot" package
And that's pretty great as well.
-
Another thing you can do is
you can say labels is equal to true,
-
and what this will do is that it
will put the counts in,
-
in each category on top of the bars.
-
So you can see that there were
50 in between one and two,
-
one between two and three,
15 between three and four, etc...
-
Another, And then the last argument,
-
or actually the second to last argument,
is you can always change the color.
-
And I'm going to pick
a color called sky blue.
-
Doesn't that look pretty.
Alright,
-
the one other thing we will talk about
is sometimes you could have,
-
a frequency histogram
where you just say how many were,
-
in between one and two,
two and three and put them in there.
-
Or you can do a, based off
of kind of probability densities.
-
So meaning this is,
-
probably, you know,
what proportion of the data lies in here.
-
So instead of having these be,
-
frequency numbers they would be like
proportion of the whole.
-
So you know this has got quite a bit.
-
So this could be like, you know,
almost a third of the data,
-
or a quarter of the data could be
all in just this section.
-
And so you can also make a
histogram be like that as well.
-
So I'm going to,
-
copy this,
-
and make it be very similar.
-
But I'm going to kind of just
say it's a density function now.
-
I'm going to temporarily
take out the "y" lim argument.
-
And then we'll kind of see,
-
how that works right now.
-
And then the only other argument you
would need to change is just say freq,
-
or frequency "f" "r" "e" "q",
-
say it's equal to false meaning
do not make it a frequency table,
-
and see what that does.
-
So now you can see that
this is about a third of the data.
-
Oh it's getting cut off a little bit because
we'll want to change our "y" limits.
-
Notice it's not going from zero to 60,
-
and because these are all now
percentages or proportions.
-
So I'm going to make my "y" limit be,
-
From zero to one which is the next
that it kind of can be so.
-
Yeah that was probably a little too high
but I mean it still gets to point across.
-
So as you can see, almost
a third of the data is all contained here.
-
then we've got, 0.7% of the data is here.
-
If you were to add up all of these numbers
together, it would sum to one.
-
Just like the, a probability
distribution function.
-
The area under all of its curve,
-
and it's in, it's kind of, in
its range is, needs to sum up to one.
-
So this is a way that you can kind of
show that, with your own data.
-
Now if you want to, create
a histogram of the petal length variable,
-
but remember that there were three
different species as well.
-
There was setosa,
versicolor and virginica.
-
If you wanted to show a histogram
of just the petal length for like,
-
just the setosa variable
of the setosa species,
-
excuse me, there isn't a
quick and easy way to do that.
-
You just kind of have to create
a separate histogram for each. So,
-
say, like you wanted a histogram
of just the petal length,
-
for setosas just for virginicas
and just for versicolors.
-
This is how we can do that.
-
I'm going to copy and paste
some code that I have,
-
just so you can see.
-
So histograms for the
petal length variable,
-
I didn't quite do all of the different
specifications and customizations,
-
but I did enough to make it look good.
-
So this is what it looks like
for all of the petal lengths,
-
not for separated by species.
-
If you want to separate them out
by species, you can do that,
-
by using bracket notation
and logical operators.
-
So bracket notation
is how you subset data.
-
So what we're doing here is we're
saying I want the petal length,
-
the variable, the numerical
variable to be plotted.
-
But I only want the values
that meet a certain criteria.
-
And the criteria is we want it,
-
the only the ones where the
species is equal to setosa.
-
So this will now plot a,
-
histogram of all of the petal lengths,
but only for the ones that are setosas.
-
So what's interesting here is
you can see that,
-
all of the setosas have
really small petal lengths.
-
And I do the exact same thing,
-
here, I just change each of these
to versicolor and virginica,
-
and the colors to.
-
And I have it so that they're all
on the same scale as well,
-
so that it's easier to compare.
-
That's also really important
to make sure you do,
-
if you're going to make comparative
histograms make sure they're all,
-
on the same scale.
-
Here's what versicolor would look like.
-
So the versicolor ones are all right here.
-
And the virginicas, are the
biggest of the petal lengths.
-
So as you can kind of see
here they start at about 4.5.
-
These start they have some in the 4.5,
but it's definitely in this middle range.
-
And then the petal,
the setosas are all the way over here.
-
And so if you compare that to this
this is obviously the setosas.
-
These are the versicolors and
these are kind of more the virginicas.
-
There is a way to also,
-
in to poss-
-
in other packages and things in "R"
for you to even specify it even further.
-
So you could maybe even show that
this is setosa, versicolor, virginica.
-
But for right now we're just going to show
you how you can do each one separately.
-
This is probably the easiest way
to go about doing that.
-
Alright, that is everything,
-
You should need to know about creating
histograms for your numerical data in "R".
-
And we will see you in the next video.