< Return to Video

Histograms in R

  • 0:02 - 0:03
    NARRATOR: Hi! In this video,
  • 0:03 - 0:07
    we are going to be talking about
    graphical data summary
  • 0:07 - 0:09
    for numerical data.
  • 0:10 - 0:13
    We are going to be, um,
  • 0:13 - 0:15
    using the iris dataset,
  • 0:16 - 0:17
    for this.
  • 0:17 - 0:22
    And the first graphical data summary
    we're going to talk about are histograms.
  • 0:24 - 0:27
    So, if you remember from the iris dataset,
  • 0:28 - 0:31
    there were four different
    numerical variables you could choose from.
  • 0:32 - 0:36
    Sepal length, sepal width,
    petal length and petal width.
  • 0:36 - 0:41
    And I'm going to use
    the petal length, um, variable today.
  • 0:43 - 0:47
    So, to create a histogram of a--
    of a numeric var-variable,
  • 0:48 - 0:50
    you will use the function, "hist()."
  • 0:52 - 0:54
    And then, you can just go ahead and put in
  • 0:54 - 0:57
    whatever data you want
    to produce a histogram for.
  • 0:58 - 1:00
    Since this is already
    a numerical variable,
  • 1:01 - 1:01
    it's good to go!
  • 1:02 - 1:03
    All you need is just this.
  • 1:05 - 1:07
    And here you can see a histogram,
  • 1:08 - 1:10
    of all the various petal lengths.
  • 1:10 - 1:13
    There's a lot of them that are
    in between one and one and a half,
  • 1:13 - 1:16
    and, yknow, fair amount over here.
  • 1:17 - 1:19
    Definitely not really
    normally distributed,
  • 1:19 - 1:20
    not like a bell curve.
  • 1:21 - 1:23
    Kind of, got a couple of peaks.
  • 1:24 - 1:29
    So, if you want to customize
    your...histogram,
  • 1:29 - 1:35
    here are some additional arguments
    that you can put in, to make it,
  • 1:37 - 1:39
    more pretty, I guess.
    [chuckles]
  • 1:39 - 1:43
    So first one is,
    similar to categorical data
  • 1:43 - 1:45
    —when we were doing the graphical.
  • 1:45 - 1:47
    You can create a title.
  • 1:48 - 1:50
    So there is a default title here,
    "Histogram of..."
  • 1:51 - 1:54
    But if you don't really want
    this R syntax to be showing,
  • 1:54 - 1:55
    you can totally change it.
  • 1:55 - 1:56
    So...
  • 1:58 - 1:59
    Let's do...
  • 2:02 - 2:05
    "Histogram of Iris Petal Lengths."
  • 2:06 - 2:08
    There we go, much better.
  • 2:09 - 2:11
    We can also, um, add,
  • 2:12 - 2:15
    x-labels for the x-axis,
  • 2:15 - 2:18
    so if we didn't want
    to have that there, as well.
  • 2:18 - 2:19
    You can also change the y-label,
  • 2:19 - 2:22
    but I'm going to keep it just like that.
  • 2:25 - 2:27
    So, and you know,
    these are all available to you,
  • 2:27 - 2:29
    but you don't have to change all of them.
  • 2:29 - 2:32
    Just kind of whatever you feel, so.
  • 2:34 - 2:36
    I'm going to call this, "Petal Length."
  • 2:36 - 2:38
    There we go, looks much better down here.
  • 2:39 - 2:43
    You can change the limits
    of your x-axis and your y-axis.
  • 2:43 - 2:45
    So obviously, with the Y,
  • 2:45 - 2:47
    it's not quite reaching
    all the way up there.
  • 2:48 - 2:49
    So I'm going to change that.
  • 2:49 - 2:52
    And then, I like to always have
    a little buffer on the sides, too,
  • 2:52 - 2:55
    so maybe we could have it
    go from zero to eight or something.
  • 2:56 - 3:03
    So for the x-axis, I'm going to change
    my limits to go from 0 to 8.
  • 3:05 - 3:12
    And my y-axis, I'm going to have
    the axis go from 0 to 60.
  • 3:12 - 3:13
    Now,
  • 3:13 - 3:17
    you just kind of just play around
    and find out which one looks good,
  • 3:17 - 3:19
    and kind of when you figured out
    what you like the best.
  • 3:20 - 3:23
    There really is no right or wrong, um,
  • 3:23 - 3:24
    magic number.
  • 3:24 - 3:28
    Just kind of--to make it look nice,
    is all we're really kind of doing.
  • 3:30 - 3:32
    Alright, so now you can see here.
  • 3:32 - 3:34
    Maybe 60 was a little too much, but.
  • 3:34 - 3:35
    Eh, it works.
  • 3:35 - 3:38
    But now you can see there's a side,
    buffers on the side.
  • 3:38 - 3:42
    You can see it easily,
    where it easily starts and ends.
  • 3:43 - 3:46
    So, I probably actually might take
    that in a little more
  • 3:46 - 3:49
    and squish that down
    a little more, but, y'know.
  • 3:49 - 3:51
    For our purposes, it's probably just fine.
  • 3:53 - 3:58
    There is another argument that you can do
    that's new that's called "breaks,"
  • 3:59 - 4:03
    where "breaks" kind of specifies
    a suggested number of bins.
  • 4:04 - 4:07
    It can't always perfectly say like,
  • 4:07 - 4:09
    so if I want to say breaks--
  • 4:09 - 4:12
    And by bins,
    I'm meaning how many boxes,
  • 4:12 - 4:15
    it's using and how many
    to create the histogram.
  • 4:16 - 4:19
    So if I were to say like eight,
  • 4:19 - 4:21
    it's going to try and do
    its very best to create it,
  • 4:21 - 4:25
    so there's only eight boxes.
    So there's one, two,
  • 4:25 - 4:29
    There would technically be three,
    4, 5, 6, 7, 8, 9, 10, 11, 12,
  • 4:29 - 4:32
    right now, let's see
    how well it can do with eight.
  • 4:33 - 4:36
    If you specify seven
    it might do six or eight,
  • 4:36 - 4:40
    just kind of depending on how the data is,
    it does its very best to match,
  • 4:40 - 4:42
    whatever number you put there.
  • 4:43 - 4:47
    So you see here we got
    one, two, three, four, five, six.
  • 4:49 - 4:53
    So yeah, So it definitely
    made it so that the shape,
  • 4:53 - 4:58
    of the graphic looks different.
    The you know, the bars are wider.
  • 4:59 - 5:02
    But it didn't quite do exactly as eight.
  • 5:03 - 5:06
    So that one is,
    you just kind of have to do your very best.
  • 5:08 - 5:10
    If you want to learn more
    about how you can kind of be,
  • 5:10 - 5:15
    more specific on how you
    can set the actual breakpoints,
  • 5:16 - 5:19
    you should take
    statistical visualization,
  • 5:19 - 5:22
    or, you can research how to do that,
  • 5:22 - 5:29
    or you can use the "ggplot" package
    And that's pretty great as well.
  • 5:30 - 5:35
    Another thing you can do is
    you can say labels is equal to true,
  • 5:35 - 5:38
    and what this will do is that it
    will put the counts in,
  • 5:38 - 5:41
    in each category on top of the bars.
  • 5:43 - 5:47
    So you can see that there were
    50 in between one and two,
  • 5:47 - 5:52
    one between two and three,
    15 between three and four, etc...
  • 5:55 - 5:57
    Another, And then the last argument,
  • 5:57 - 6:00
    or actually the second to last argument,
    is you can always change the color.
  • 6:01 - 6:04
    And I'm going to pick
    a color called sky blue.
  • 6:06 - 6:09
    Doesn't that look pretty.
    Alright,
  • 6:09 - 6:13
    the one other thing we will talk about
    is sometimes you could have,
  • 6:13 - 6:16
    a frequency histogram
    where you just say how many were,
  • 6:16 - 6:20
    in between one and two,
    two and three and put them in there.
  • 6:20 - 6:25
    Or you can do a, based off
    of kind of probability densities.
  • 6:26 - 6:29
    So meaning this is,
  • 6:29 - 6:33
    probably, you know,
    what proportion of the data lies in here.
  • 6:33 - 6:34
    So instead of having these be,
  • 6:34 - 6:38
    frequency numbers they would be like
    proportion of the whole.
  • 6:38 - 6:41
    So you know this has got quite a bit.
  • 6:41 - 6:45
    So this could be like, you know,
    almost a third of the data,
  • 6:45 - 6:49
    or a quarter of the data could be
    all in just this section.
  • 6:49 - 6:53
    And so you can also make a
    histogram be like that as well.
  • 6:54 - 6:55
    So I'm going to,
  • 6:56 - 6:59
    copy this,
  • 6:59 - 7:01
    and make it be very similar.
  • 7:01 - 7:05
    But I'm going to kind of just
    say it's a density function now.
  • 7:05 - 7:10
    I'm going to temporarily
    take out the "y" lim argument.
  • 7:11 - 7:14
    And then we'll kind of see,
  • 7:15 - 7:17
    how that works right now.
  • 7:17 - 7:20
    And then the only other argument you
    would need to change is just say freq,
  • 7:20 - 7:23
    or frequency "f" "r" "e" "q",
  • 7:23 - 7:27
    say it's equal to false meaning
    do not make it a frequency table,
  • 7:28 - 7:30
    and see what that does.
  • 7:32 - 7:35
    So now you can see that
    this is about a third of the data.
  • 7:35 - 7:40
    Oh it's getting cut off a little bit because
    we'll want to change our "y" limits.
  • 7:40 - 7:42
    Notice it's not going from zero to 60,
  • 7:42 - 7:45
    and because these are all now
    percentages or proportions.
  • 7:46 - 7:49
    So I'm going to make my "y" limit be,
  • 7:52 - 7:58
    From zero to one which is the next
    that it kind of can be so.
  • 7:59 - 8:03
    Yeah that was probably a little too high
    but I mean it still gets to point across.
  • 8:03 - 8:07
    So as you can see, almost
    a third of the data is all contained here.
  • 8:08 - 8:12
    then we've got, 0.7% of the data is here.
  • 8:12 - 8:16
    If you were to add up all of these numbers
    together, it would sum to one.
  • 8:17 - 8:23
    Just like the, a probability
    distribution function.
  • 8:24 - 8:27
    The area under all of its curve,
  • 8:27 - 8:32
    and it's in, it's kind of, in
    its range is, needs to sum up to one.
  • 8:32 - 8:36
    So this is a way that you can kind of
    show that, with your own data.
  • 8:41 - 8:47
    Now if you want to, create
    a histogram of the petal length variable,
  • 8:47 - 8:50
    but remember that there were three
    different species as well.
  • 8:50 - 8:53
    There was setosa,
    versicolor and virginica.
  • 8:53 - 8:58
    If you wanted to show a histogram
    of just the petal length for like,
  • 8:58 - 9:02
    just the setosa variable
    of the setosa species,
  • 9:02 - 9:06
    excuse me, there isn't a
    quick and easy way to do that.
  • 9:06 - 9:10
    You just kind of have to create
    a separate histogram for each. So,
  • 9:11 - 9:14
    say, like you wanted a histogram
    of just the petal length,
  • 9:14 - 9:17
    for setosas just for virginicas
    and just for versicolors.
  • 9:19 - 9:21
    This is how we can do that.
  • 9:22 - 9:26
    I'm going to copy and paste
    some code that I have,
  • 9:27 - 9:29
    just so you can see.
  • 9:33 - 9:36
    So histograms for the
    petal length variable,
  • 9:36 - 9:41
    I didn't quite do all of the different
    specifications and customizations,
  • 9:41 - 9:44
    but I did enough to make it look good.
  • 9:45 - 9:48
    So this is what it looks like
    for all of the petal lengths,
  • 9:48 - 9:51
    not for separated by species.
  • 9:52 - 9:55
    If you want to separate them out
    by species, you can do that,
  • 9:55 - 10:00
    by using bracket notation
    and logical operators.
  • 10:00 - 10:03
    So bracket notation
    is how you subset data.
  • 10:03 - 10:07
    So what we're doing here is we're
    saying I want the petal length,
  • 10:07 - 10:10
    the variable, the numerical
    variable to be plotted.
  • 10:10 - 10:14
    But I only want the values
    that meet a certain criteria.
  • 10:14 - 10:17
    And the criteria is we want it,
  • 10:17 - 10:21
    the only the ones where the
    species is equal to setosa.
  • 10:22 - 10:23
    So this will now plot a,
  • 10:23 - 10:29
    histogram of all of the petal lengths,
    but only for the ones that are setosas.
  • 10:33 - 10:35
    So what's interesting here is
    you can see that,
  • 10:35 - 10:38
    all of the setosas have
    really small petal lengths.
  • 10:40 - 10:43
    And I do the exact same thing,
  • 10:43 - 10:46
    here, I just change each of these
    to versicolor and virginica,
  • 10:46 - 10:48
    and the colors to.
  • 10:48 - 10:52
    And I have it so that they're all
    on the same scale as well,
  • 10:52 - 10:54
    so that it's easier to compare.
  • 10:54 - 10:56
    That's also really important
    to make sure you do,
  • 10:56 - 10:59
    if you're going to make comparative
    histograms make sure they're all,
  • 10:59 - 11:00
    on the same scale.
  • 11:02 - 11:05
    Here's what versicolor would look like.
  • 11:05 - 11:08
    So the versicolor ones are all right here.
  • 11:09 - 11:14
    And the virginicas, are the
    biggest of the petal lengths.
  • 11:15 - 11:19
    So as you can kind of see
    here they start at about 4.5.
  • 11:21 - 11:27
    These start they have some in the 4.5,
    but it's definitely in this middle range.
  • 11:27 - 11:30
    And then the petal,
    the setosas are all the way over here.
  • 11:31 - 11:34
    And so if you compare that to this
    this is obviously the setosas.
  • 11:36 - 11:40
    These are the versicolors and
    these are kind of more the virginicas.
  • 11:41 - 11:44
    There is a way to also,
  • 11:45 - 11:47
    in to poss-
  • 11:47 - 11:52
    in other packages and things in "R"
    for you to even specify it even further.
  • 11:52 - 11:57
    So you could maybe even show that
    this is setosa, versicolor, virginica.
  • 11:57 - 12:01
    But for right now we're just going to show
    you how you can do each one separately.
  • 12:01 - 12:04
    This is probably the easiest way
    to go about doing that.
  • 12:06 - 12:09
    Alright, that is everything,
  • 12:09 - 12:13
    You should need to know about creating
    histograms for your numerical data in "R".
  • 12:14 - 12:16
    And we will see you in the next video.
Title:
Histograms in R
Video Language:
English
Duration:
12:18
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Utah_State_University edited English subtitles for Histograms in R
Show all

English subtitles

Revisions Compare revisions