-
Hi. In
this video we are going to be talking
-
about graphical data
summary for numerical data.
-
We are going to be using the iris dataset.
-
For this
and the first graphical data summary
-
we're going to talk about are histograms.
-
So if you remember from the iris data set
-
there were four different
numerical variables you could choose from
-
sepal length,
sepal width petal length and petal width.
-
And I'm going to use the petal length
-
variable today.
-
So to create a histogram event
-
of a numeric variable
-
you will use the function hist.
-
And then you can just go ahead and put in
-
whatever data
you want to produce a histogram for.
-
Since this is already
a numerical variable it's good to go.
-
All you need is just this.
-
And here you can see
-
a histogram
of all the various petal lengths.
-
There's a lot of them
that are in between 1 and 1 and a half
-
and, you know fair amount over here.
-
Definitely not really normally
distributed, not like a bell curve.
-
Kind of got a couple of peaks.
-
So if you want to customize your
-
histogram,
here are some additional arguments
-
that you can put in to make it.
-
More pretty.
-
I guess.
-
So first
one is similar to categorical data.
-
When we were doing the graphical
you can create a title,
-
so there is a default title
here, histogram.
-
But if you don't really want this,
our syntax to be showing,
-
you can totally change it. So
-
let's do.
-
Histogram of iris petal lengths.
-
There we go.
-
Much better.
-
We can also, add
-
x labels for the x axis.
-
So if we didn't want to have that there as
well you can also change the Y label.
-
But I'm going to keep it just like that.
-
So and you know,
these are all available to you,
-
but you don't have to change all of them,
just kind of whatever you feel so
-
I'm going to call this petal length.
-
There we go. Looks much better down here.
-
You can change the limits of your x axis
and your y axis.
-
So obviously with y it's
not quite reaching all the way up there.
-
So I'm going to change that.
-
And then I like to always have
a little buffer on the sides too.
-
So maybe we could have it
go from 0 to 8 or something.
-
So for the x axis I'm going to change
-
my limits to go from 0 to 8.
-
And my y
-
axis I'm going to have the axes go from
-
0 to 60.
-
Now it's kind of just play around
and find out which one looks good.
-
And kind of when you figured out
what you like the best.
-
There really is no right or wrong,
-
magic number.
-
Just kind of, to make it look nice is all.
-
We're really kind of doing.
-
All right.
-
So now you can see here maybe 60
was a little too much, but yeah, it works.
-
But now you can see there's buffers
on the side.
-
You can see it easily
where it easily starts and ends.
-
So I probably actually might take that
-
in a little more
and squish that down a little more. But,
-
you know, for our purposes
it's probably just fine.
-
There is another,
-
argument that you can do
that's new, that's called breaks
-
where breaks kind of specifies
a suggested number of bins.
-
It can't always perfectly say
-
like so if I want to see breaks in by bins
and meaning how many like boxes
-
it's using
and how many to create the histogram.
-
So if I were to say like eight,
-
it's going to try
and do its very best to create it.
-
So there's only eight boxes.
-
So there's one, two.
-
There are technically be three, 456789,
ten, 11, 12.
-
Right.
-
Now let's
see how well it can do with eight.
-
If you specify seven it might do 6 or 8.
-
Just kind of depending on how the data is,
it does its very best to match
-
whatever number you put there.
-
So you see here we got one, 234,
five, six.
-
So yeah.
-
So it definitely made it so that the shape
-
of the graphic looks different.
-
The you know, the bars are wider.
-
But it didn't quite do exactly as eight.
-
So that one is,
you just kind of have to do your very best
-
if you want to learn more
about how you can kind of be
-
more specific
on how you can set the actual breakpoints,
-
you should take statistics
or visualization,
-
or, you can research how to do that,
-
or you can use the ggplot package.
-
And that's pretty great as well.
-
Another thing you can do is
-
you can say labels is equal to true.
-
And what this will do is that it
will put the counts in,
-
in each category on top of the bars.
-
So you can
-
see that there were 50 in between 1
and 2, one
-
between 2 and 3, 15 between 3 and 4, etc..
-
Another.
-
And then the last argument,
-
or actually the second to last argument,
is you can always change the color.
-
And I'm going to pick
a color called sky blue.
-
Doesn't that look pretty.
-
All right.
-
The one other thing we will talk about
is sometimes you could have
-
a frequency histogram
where you just say how many were
-
in between 1 and 2, two and three
and put them in there.
-
Or you can do a
-
based off
of kind of probability densities.
-
So meaning this is
-
probably, you know,
what proportion of the data lies in here.
-
So instead of having these be
-
frequency numbers they would be like
proportion of the whole.
-
So you know this has got quite a bit.
-
So this could be like,
you know, almost a third of the data
-
or a quarter of the data could be all in
just this section.
-
And so you can also make a histogram
be like that as well.
-
So I'm going to
-
copy this
-
and make it speed very similar.
-
But I'm going to kind of
just say it's a density function.
-
Now I'm going to temporarily
take out the Y lim
-
argument.
-
And then kind of see.
-
How that works right now.
-
And then the only other argument
you would need to change is just say freq
-
or frequency freq, say
-
it's equal to false
meaning do not make it a frequency table
-
and see what that does.
-
So now you can see that
this is about a third of the data.
-
Oh it's getting cut off a little bit
because we want to change our y limits.
-
Notice it's not going from 0 to 60.
-
And because these are all now percentages
or proportions.
-
So I'm going to make my y limit be.
-
From 0 to 1 which is the next that
-
kind of can be so
-
yeah that was probably a little too high.
-
But I mean it still gets to point across.
-
So as you can see, almost
a third of the data is all contained here.
-
And got, 0.7% of the data is here.
-
If you were to add up all of these numbers
together, it would sum to one.
-
Just like the,
-
a probability, distribution function.
-
The area under all of its curve,
-
and it's in, it's kind of, in
its range is needs to sum up to one.
-
So this is a way that you can kind of
show that,
-
with your own data.
-
Now if you want to,
-
create
a histogram of the petal length variable.
-
But remember that there were three
different species as well.
-
There was setosa,
versicolor and virginica.
-
If you wanted to show a histogram
of just the petal length for like
-
just the setosa variable
or the setosa species,
-
excuse me, there isn't a quick and easy
way to do that.
-
You just kind of have to create
a separate histogram for each. So
-
see, like you wanted a histogram
of just the petal length
-
for setosa is just for virginica
and just for versicolor.
-
This is how we can do that.
-
I'm going to copy
-
and paste some code that I have
-
just so you can see.
-
So histograms
for the petal length variable
-
I didn't quite do all of the different
specifications and customizations,
-
but I did enough to make it look good.
-
So this is what it looks like
for all of the petal lengths,
-
not for separated by species.
-
If you want to separate them out
by species, you can do that
-
by using bracket notation
-
and logical operators.
-
So bracket
notation is how you subset data.
-
So what we're doing here
is we're saying I want the petal length
-
variable
the numerical variable to be plotted.
-
But I only want the values
that meet a certain criteria.
-
And the criteria is we want it
-
the only the ones where the
species is equal to setosa.
-
So this will now plot a
-
histogram of all of the petal length,
but only for the ones that are.
-
Setosa is.
-
So what's interesting here is
you can see that
-
all of the statuses
have really small petal lakes.
-
And I do the exact same thing.
-
Here I just change each of these
to versicolor and virginica
-
and the colors to.
-
And I have it so that they're all
on the same scale as well,
-
so that it's easier to compare.
-
That's also really important
to make sure you do.
-
If you're going to make comparative
-
histograms,
make sure they're all on the same scale.
-
Here's what versicolor would look like.
-
So the versicolor ones are all right here.
-
And the virginica is
-
are the biggest of the petal lakes.
-
So as you can kind of see
here they start at about 4.5.
-
These start they have some in the 4.5.
-
But it's definitely in this middle range.
-
And then the petal
the statuses are all the way over here.
-
And so if you compare that to this
this is obviously the statuses.
-
These are the versicolor.
-
And these are kind of more the Virginias.
-
There is a way to also
-
in to pass
-
in other packages and things in order
for you to even specify it even further.
-
So you could maybe even show
that this is setosa,
-
versicolor virginica. But for right now,
-
we're just going to show you
how you can do each one separately.
-
This is probably the easiest way
to go about doing that.
-
All right.
-
That is everything.
-
You should need to know about creating
histograms for your numerical data in R.
-
And we will see you in the next video.