< Return to Video

Histograms in R

  • 0:02 - 0:05
    Hi. In
    this video we are going to be talking
  • 0:05 - 0:09
    about graphical data
    summary for numerical data.
  • 0:10 - 0:15
    We are going to be using the iris dataset.
  • 0:16 - 0:19
    For this
    and the first graphical data summary
  • 0:19 - 0:22
    we're going to talk about are histograms.
  • 0:24 - 0:27
    So if you remember from the iris data set
  • 0:28 - 0:31
    there were four different
    numerical variables you could choose from
  • 0:32 - 0:35
    sepal length,
    sepal width petal length and petal width.
  • 0:36 - 0:39
    And I'm going to use the petal length
  • 0:40 - 0:42
    variable today.
  • 0:42 - 0:45
    So to create a histogram event
  • 0:45 - 0:48
    of a numeric variable
  • 0:48 - 0:51
    you will use the function hist.
  • 0:52 - 0:54
    And then you can just go ahead and put in
  • 0:54 - 0:57
    whatever data
    you want to produce a histogram for.
  • 0:58 - 1:01
    Since this is already
    a numerical variable it's good to go.
  • 1:02 - 1:03
    All you need is just this.
  • 1:05 - 1:06
    And here you can see
  • 1:06 - 1:09
    a histogram
    of all the various petal lengths.
  • 1:10 - 1:13
    There's a lot of them
    that are in between 1 and 1 and a half
  • 1:13 - 1:16
    and, you know fair amount over here.
  • 1:16 - 1:20
    Definitely not really normally
    distributed, not like a bell curve.
  • 1:21 - 1:24
    Kind of got a couple of peaks.
  • 1:24 - 1:27
    So if you want to customize your
  • 1:28 - 1:32
    histogram,
    here are some additional arguments
  • 1:32 - 1:35
    that you can put in to make it.
  • 1:37 - 1:38
    More pretty.
  • 1:38 - 1:39
    I guess.
  • 1:39 - 1:43
    So first
    one is similar to categorical data.
  • 1:43 - 1:47
    When we were doing the graphical
    you can create a title,
  • 1:48 - 1:51
    so there is a default title
    here, histogram.
  • 1:51 - 1:54
    But if you don't really want this,
    our syntax to be showing,
  • 1:54 - 1:57
    you can totally change it. So
  • 1:58 - 2:01
    let's do.
  • 2:02 - 2:04
    Histogram of iris petal lengths.
  • 2:06 - 2:07
    There we go.
  • 2:07 - 2:09
    Much better.
  • 2:09 - 2:12
    We can also, add
  • 2:12 - 2:15
    x labels for the x axis.
  • 2:15 - 2:19
    So if we didn't want to have that there as
    well you can also change the Y label.
  • 2:19 - 2:22
    But I'm going to keep it just like that.
  • 2:25 - 2:27
    So and you know,
    these are all available to you,
  • 2:27 - 2:32
    but you don't have to change all of them,
    just kind of whatever you feel so
  • 2:34 - 2:36
    I'm going to call this petal length.
  • 2:36 - 2:39
    There we go. Looks much better down here.
  • 2:39 - 2:43
    You can change the limits of your x axis
    and your y axis.
  • 2:43 - 2:47
    So obviously with y it's
    not quite reaching all the way up there.
  • 2:48 - 2:49
    So I'm going to change that.
  • 2:49 - 2:52
    And then I like to always have
    a little buffer on the sides too.
  • 2:52 - 2:55
    So maybe we could have it
    go from 0 to 8 or something.
  • 2:56 - 3:00
    So for the x axis I'm going to change
  • 3:00 - 3:03
    my limits to go from 0 to 8.
  • 3:05 - 3:06
    And my y
  • 3:06 - 3:09
    axis I'm going to have the axes go from
  • 3:09 - 3:12
    0 to 60.
  • 3:12 - 3:17
    Now it's kind of just play around
    and find out which one looks good.
  • 3:17 - 3:20
    And kind of when you figured out
    what you like the best.
  • 3:20 - 3:23
    There really is no right or wrong,
  • 3:23 - 3:24
    magic number.
  • 3:24 - 3:26
    Just kind of, to make it look nice is all.
  • 3:26 - 3:29
    We're really kind of doing.
  • 3:30 - 3:30
    All right.
  • 3:30 - 3:35
    So now you can see here maybe 60
    was a little too much, but yeah, it works.
  • 3:35 - 3:38
    But now you can see there's buffers
    on the side.
  • 3:38 - 3:42
    You can see it easily
    where it easily starts and ends.
  • 3:43 - 3:45
    So I probably actually might take that
  • 3:45 - 3:48
    in a little more
    and squish that down a little more. But,
  • 3:49 - 3:51
    you know, for our purposes
    it's probably just fine.
  • 3:53 - 3:55
    There is another,
  • 3:55 - 3:58
    argument that you can do
    that's new, that's called breaks
  • 3:59 - 4:03
    where breaks kind of specifies
    a suggested number of bins.
  • 4:04 - 4:07
    It can't always perfectly say
  • 4:07 - 4:12
    like so if I want to see breaks in by bins
    and meaning how many like boxes
  • 4:12 - 4:15
    it's using
    and how many to create the histogram.
  • 4:16 - 4:19
    So if I were to say like eight,
  • 4:19 - 4:21
    it's going to try
    and do its very best to create it.
  • 4:21 - 4:23
    So there's only eight boxes.
  • 4:23 - 4:25
    So there's one, two.
  • 4:25 - 4:29
    There are technically be three, 456789,
    ten, 11, 12.
  • 4:29 - 4:29
    Right.
  • 4:29 - 4:32
    Now let's
    see how well it can do with eight.
  • 4:33 - 4:36
    If you specify seven it might do 6 or 8.
  • 4:36 - 4:40
    Just kind of depending on how the data is,
    it does its very best to match
  • 4:41 - 4:42
    whatever number you put there.
  • 4:43 - 4:47
    So you see here we got one, 234,
    five, six.
  • 4:49 - 4:49
    So yeah.
  • 4:49 - 4:53
    So it definitely made it so that the shape
  • 4:53 - 4:56
    of the graphic looks different.
  • 4:56 - 4:59
    The you know, the bars are wider.
  • 4:59 - 5:02
    But it didn't quite do exactly as eight.
  • 5:03 - 5:06
    So that one is,
    you just kind of have to do your very best
  • 5:08 - 5:10
    if you want to learn more
    about how you can kind of be
  • 5:10 - 5:15
    more specific
    on how you can set the actual breakpoints,
  • 5:16 - 5:19
    you should take statistics
    or visualization,
  • 5:19 - 5:22
    or, you can research how to do that,
  • 5:22 - 5:26
    or you can use the ggplot package.
  • 5:26 - 5:29
    And that's pretty great as well.
  • 5:30 - 5:32
    Another thing you can do is
  • 5:32 - 5:35
    you can say labels is equal to true.
  • 5:35 - 5:38
    And what this will do is that it
    will put the counts in,
  • 5:38 - 5:41
    in each category on top of the bars.
  • 5:42 - 5:43
    So you can
  • 5:43 - 5:47
    see that there were 50 in between 1
    and 2, one
  • 5:47 - 5:52
    between 2 and 3, 15 between 3 and 4, etc..
  • 5:55 - 5:56
    Another.
  • 5:56 - 5:57
    And then the last argument,
  • 5:57 - 6:00
    or actually the second to last argument,
    is you can always change the color.
  • 6:01 - 6:04
    And I'm going to pick
    a color called sky blue.
  • 6:06 - 6:09
    Doesn't that look pretty.
  • 6:09 - 6:09
    All right.
  • 6:09 - 6:13
    The one other thing we will talk about
    is sometimes you could have
  • 6:13 - 6:16
    a frequency histogram
    where you just say how many were
  • 6:16 - 6:20
    in between 1 and 2, two and three
    and put them in there.
  • 6:20 - 6:23
    Or you can do a
  • 6:23 - 6:25
    based off
    of kind of probability densities.
  • 6:26 - 6:29
    So meaning this is
  • 6:29 - 6:33
    probably, you know,
    what proportion of the data lies in here.
  • 6:33 - 6:34
    So instead of having these be
  • 6:34 - 6:38
    frequency numbers they would be like
    proportion of the whole.
  • 6:38 - 6:41
    So you know this has got quite a bit.
  • 6:41 - 6:45
    So this could be like,
    you know, almost a third of the data
  • 6:45 - 6:49
    or a quarter of the data could be all in
    just this section.
  • 6:49 - 6:53
    And so you can also make a histogram
    be like that as well.
  • 6:54 - 6:55
    So I'm going to
  • 6:56 - 6:59
    copy this
  • 6:59 - 7:01
    and make it speed very similar.
  • 7:01 - 7:04
    But I'm going to kind of
    just say it's a density function.
  • 7:04 - 7:08
    Now I'm going to temporarily
    take out the Y lim
  • 7:09 - 7:11
    argument.
  • 7:11 - 7:14
    And then kind of see.
  • 7:15 - 7:17
    How that works right now.
  • 7:17 - 7:20
    And then the only other argument
    you would need to change is just say freq
  • 7:20 - 7:23
    or frequency freq, say
  • 7:23 - 7:27
    it's equal to false
    meaning do not make it a frequency table
  • 7:28 - 7:31
    and see what that does.
  • 7:32 - 7:35
    So now you can see that
    this is about a third of the data.
  • 7:35 - 7:40
    Oh it's getting cut off a little bit
    because we want to change our y limits.
  • 7:40 - 7:42
    Notice it's not going from 0 to 60.
  • 7:42 - 7:45
    And because these are all now percentages
    or proportions.
  • 7:46 - 7:49
    So I'm going to make my y limit be.
  • 7:52 - 7:55
    From 0 to 1 which is the next that
  • 7:57 - 7:59
    kind of can be so
  • 7:59 - 8:01
    yeah that was probably a little too high.
  • 8:01 - 8:03
    But I mean it still gets to point across.
  • 8:03 - 8:07
    So as you can see, almost
    a third of the data is all contained here.
  • 8:08 - 8:12
    And got, 0.7% of the data is here.
  • 8:12 - 8:16
    If you were to add up all of these numbers
    together, it would sum to one.
  • 8:17 - 8:20
    Just like the,
  • 8:20 - 8:23
    a probability, distribution function.
  • 8:24 - 8:27
    The area under all of its curve,
  • 8:27 - 8:32
    and it's in, it's kind of, in
    its range is needs to sum up to one.
  • 8:32 - 8:35
    So this is a way that you can kind of
    show that,
  • 8:35 - 8:38
    with your own data.
  • 8:41 - 8:43
    Now if you want to,
  • 8:43 - 8:47
    create
    a histogram of the petal length variable.
  • 8:47 - 8:50
    But remember that there were three
    different species as well.
  • 8:50 - 8:53
    There was setosa,
    versicolor and virginica.
  • 8:53 - 8:58
    If you wanted to show a histogram
    of just the petal length for like
  • 8:58 - 9:02
    just the setosa variable
    or the setosa species,
  • 9:02 - 9:06
    excuse me, there isn't a quick and easy
    way to do that.
  • 9:06 - 9:10
    You just kind of have to create
    a separate histogram for each. So
  • 9:11 - 9:14
    see, like you wanted a histogram
    of just the petal length
  • 9:14 - 9:17
    for setosa is just for virginica
    and just for versicolor.
  • 9:19 - 9:22
    This is how we can do that.
  • 9:22 - 9:24
    I'm going to copy
  • 9:24 - 9:27
    and paste some code that I have
  • 9:27 - 9:30
    just so you can see.
  • 9:33 - 9:36
    So histograms
    for the petal length variable
  • 9:36 - 9:41
    I didn't quite do all of the different
    specifications and customizations,
  • 9:41 - 9:44
    but I did enough to make it look good.
  • 9:45 - 9:48
    So this is what it looks like
    for all of the petal lengths,
  • 9:48 - 9:51
    not for separated by species.
  • 9:52 - 9:55
    If you want to separate them out
    by species, you can do that
  • 9:55 - 9:58
    by using bracket notation
  • 9:58 - 10:00
    and logical operators.
  • 10:00 - 10:03
    So bracket
    notation is how you subset data.
  • 10:03 - 10:07
    So what we're doing here
    is we're saying I want the petal length
  • 10:07 - 10:10
    variable
    the numerical variable to be plotted.
  • 10:10 - 10:14
    But I only want the values
    that meet a certain criteria.
  • 10:14 - 10:17
    And the criteria is we want it
  • 10:17 - 10:21
    the only the ones where the
    species is equal to setosa.
  • 10:22 - 10:23
    So this will now plot a
  • 10:23 - 10:28
    histogram of all of the petal length,
    but only for the ones that are.
  • 10:28 - 10:31
    Setosa is.
  • 10:33 - 10:35
    So what's interesting here is
    you can see that
  • 10:35 - 10:38
    all of the statuses
    have really small petal lakes.
  • 10:40 - 10:43
    And I do the exact same thing.
  • 10:43 - 10:46
    Here I just change each of these
    to versicolor and virginica
  • 10:46 - 10:48
    and the colors to.
  • 10:48 - 10:52
    And I have it so that they're all
    on the same scale as well,
  • 10:52 - 10:54
    so that it's easier to compare.
  • 10:54 - 10:56
    That's also really important
    to make sure you do.
  • 10:56 - 10:57
    If you're going to make comparative
  • 10:57 - 11:00
    histograms,
    make sure they're all on the same scale.
  • 11:02 - 11:05
    Here's what versicolor would look like.
  • 11:05 - 11:08
    So the versicolor ones are all right here.
  • 11:09 - 11:12
    And the virginica is
  • 11:12 - 11:15
    are the biggest of the petal lakes.
  • 11:15 - 11:19
    So as you can kind of see
    here they start at about 4.5.
  • 11:21 - 11:24
    These start they have some in the 4.5.
  • 11:24 - 11:27
    But it's definitely in this middle range.
  • 11:27 - 11:30
    And then the petal
    the statuses are all the way over here.
  • 11:31 - 11:34
    And so if you compare that to this
    this is obviously the statuses.
  • 11:36 - 11:38
    These are the versicolor.
  • 11:38 - 11:41
    And these are kind of more the Virginias.
  • 11:41 - 11:44
    There is a way to also
  • 11:45 - 11:47
    in to pass
  • 11:47 - 11:52
    in other packages and things in order
    for you to even specify it even further.
  • 11:52 - 11:55
    So you could maybe even show
    that this is setosa,
  • 11:56 - 11:58
    versicolor virginica. But for right now,
  • 11:58 - 12:01
    we're just going to show you
    how you can do each one separately.
  • 12:01 - 12:04
    This is probably the easiest way
    to go about doing that.
  • 12:06 - 12:07
    All right.
  • 12:07 - 12:09
    That is everything.
  • 12:09 - 12:13
    You should need to know about creating
    histograms for your numerical data in R.
  • 12:14 - 12:16
    And we will see you in the next video.
Title:
Histograms in R
Video Language:
English
Duration:
12:18
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Show all

English subtitles

Revisions Compare revisions