What is Statistics? A Beginner's Guide to Statistics (Data Analytics)!

Edit subtitles

0:00 - 0:03

PROFESSOR: If you want to
finally understand statistics,
0:03 - 0:05

this is the place to be.
0:05 - 0:09

After this video, you will
know what statistics is,
0:09 - 0:11

what descriptive
statistics is, and what
0:11 - 0:13

inferential statistics is.
0:13 - 0:16

So let's start with
the first question.
0:16 - 0:17

What is statistics?
0:17 - 0:21

Statistics deals with
the collection, analysis,
0:21 - 0:23

and presentation of data.
0:23 - 0:24

An example.
0:24 - 0:28

We would like to investigate
whether gender has an influence
0:28 - 0:30

on the preferred newspaper.
0:30 - 0:35

Then gender and newspaper are
our so-called variables that we
0:35 - 0:36

want to analyse.
0:36 - 0:39

In order to analyse whether
gender has an influence
0:39 - 0:44

on the preferred newspaper,
we first need to collect data.
0:44 - 0:46

To do this, we create
a questionnaire
0:46 - 0:50

that asks about gender
and preferred newspaper.
0:50 - 0:54

We will then send out the
survey and wait two weeks.
0:54 - 0:59

Afterwards, we can display the
received answers in a table.
0:59 - 1:03

In this table, we have one
column for each variable,
1:03 - 1:06

one for gender and
one for newspaper.
1:06 - 1:09

On the other hand, each
row is the response
1:09 - 1:11

of one surveyed person.
1:11 - 1:16

The first respondent is male
and stated New York Post,
1:16 - 1:19

the second is female
and stated USA Today,
1:19 - 1:21

and so on and so forth.
1:21 - 1:24

Of course, the data does not
have to be from a survey.
1:24 - 1:28

The data can also come from
an experiment in which you,
1:28 - 1:32

for example, want to study the
effect of two drugs on blood
1:32 - 1:33

pressure.
1:33 - 1:34

Now, the first step is done.
1:34 - 1:39

We have collected data and we
can start analyzing the data.
1:39 - 1:41

But what do we actually
want to analyse?
1:41 - 1:44

We did not survey the
entire population,
1:44 - 1:46

but we took a sample.
1:46 - 1:49

Now the big question
is, do we just
1:49 - 1:51

want to describe
the sample data,
1:51 - 1:53

or do we want to
make a statement
1:53 - 1:55

about the whole population?
1:55 - 1:59

If our aim is limited to the
sample itself, i.e. we only
1:59 - 2:01

want to describe
the collected data,
2:01 - 2:04

we will use
descriptive statistics.
2:04 - 2:08

Descriptive statistics will
provide a detailed summary
2:08 - 2:09

of the sample.
2:09 - 2:11

However, if we want
to draw conclusions
2:11 - 2:16

about the population as a whole,
inferential statistics are used.
2:16 - 2:19

This approach allows us
to make educated guesses
2:19 - 2:22

about the population
based on the sample data.
2:22 - 2:26

Let us take a closer
look at both methods,
2:26 - 2:28

starting with
descriptive statistics.
2:28 - 2:31

Why is descriptive
statistics so important?
2:31 - 2:34

Let's say a company
wants to know how
2:34 - 2:36

its employees travel to work.
2:36 - 2:40

So the company creates a
survey to answer this question.
2:40 - 2:42

Once enough data
has been collected,
2:42 - 2:46

this data can be analyzed
using descriptive statistics.
2:46 - 2:49

But what is
descriptive statistics?
2:49 - 2:53

Descriptive statistics aims
to describe and summarize
2:53 - 2:55

a data set in a meaningful way.
2:55 - 2:59

But it is important to note
that descriptive statistics only
2:59 - 3:02

describe the collected data
without drawing conclusions
3:02 - 3:05

about a larger population.
3:05 - 3:09

Put simply, just because we know
how some people from one company
3:09 - 3:13

get to work, we cannot
say how all working people
3:13 - 3:15

of the company get to work.
3:15 - 3:17

This is the task of
inferential statistics,
3:17 - 3:19

which we will discuss later.
3:19 - 3:22

To describe data
descriptively, we now
3:22 - 3:25

look at the four
key components--
3:25 - 3:28

measures of central tendency,
measures of dispersion,
3:28 - 3:30

frequency tables and charts.
3:30 - 3:34

Let's start with the first one,
measures of central tendency.
3:34 - 3:37

Measures of central
tendency are, for example,
3:37 - 3:40

the mean, the
median, and the mode.
3:40 - 3:42

Let's first have a
look at the mean.
3:42 - 3:47

The arithmetic mean is the sum
of all observations divided
3:47 - 3:49

by the number of observations.
3:49 - 3:50

An example.
3:50 - 3:53

Imagine we have the test
scores of five students.
3:53 - 3:57

To find a mean score,
we sum up all the scores
3:57 - 3:59

and divide by the
number of scores.
3:59 - 4:04

The mean test score of these
five students is therefore 86.6.
4:04 - 4:06

What about the median?
4:06 - 4:11

When the values in a data set
are arranged in ascending order,
4:11 - 4:13

the median is the middle value.
4:13 - 4:16

If there is an odd
number of data points,
4:16 - 4:19

the median is simply
the middle value.
4:19 - 4:22

If there is an even
number of data points,
4:22 - 4:26

the median is the average
of the two middle values.
4:26 - 4:28

It is important to
note that the median is
4:28 - 4:32

resistant to extreme
values or outliers.
4:32 - 4:34

Let's look at this example.
4:34 - 4:37

No matter how tall
the last person is,
4:37 - 4:40

the person in the middle remains
the person in the middle.
4:40 - 4:43

So the median does not change.
4:43 - 4:46

But if we look at the mean,
it does have an effect
4:46 - 4:49

on how tall the last person is.
4:49 - 4:52

The mean is therefore
not robust to outliers.
4:52 - 4:54

Let's continue with the mode.
4:54 - 4:57

The mode refers to
the value or values
4:57 - 5:01

that appear most frequently
in a set of data.
5:01 - 5:05

For example, if 14 people
travel to work by car,
5:05 - 5:10

six by bike, five walk, and
five take public transport,
5:10 - 5:15

then car occurs most often
and is therefore the mode.
5:15 - 5:16

Great.
5:16 - 5:19

Let's continue with the
measures of dispersion.
5:19 - 5:22

Measures of dispersion
describe how spread out
5:22 - 5:24

the values in a data set are.
5:24 - 5:27

Measures of dispersion
are, for example,
5:27 - 5:30

the variance and standard
deviation, the range,
5:30 - 5:32

and the interquartile range.
5:32 - 5:35

Let's start with the
standard deviation.
5:35 - 5:38

The standard deviation
indicates the average distance
5:38 - 5:41

between each data
point and the mean.
5:41 - 5:43

But what does that mean?
5:43 - 5:46

Each person has some
deviation from the mean.
5:46 - 5:49

Now we want to know
how much the person's
5:49 - 5:52

deviate from the mean
value on average.
5:52 - 5:56

In this example, the average
deviation from the mean value
5:56 - 5:58

is 11.5 centimeters.
5:58 - 6:01

To calculate the
standard deviation,
6:01 - 6:03

we can use this equation.
6:03 - 6:08

Sigma is the standard deviation,
n is the number of persons,
6:08 - 6:12

xi is the size of
each person, and x bar
6:12 - 6:15

is the mean value
of all persons.
6:15 - 6:18

But attention, there are two
slightly different equations
6:18 - 6:20

for the standard deviation.
6:20 - 6:24

The difference is that we
have ones, 1 divided by n,
6:24 - 6:28

and ones, 1 divided
by n minus 1.
6:28 - 6:31

To keep it simple,
if our survey doesn't
6:31 - 6:33

cover the whole
population, we always
6:33 - 6:37

use this equation to estimate
the standard deviation.
6:37 - 6:41

Likewise, if we have
conducted a clinical study,
6:41 - 6:44

then we also use this
equation to estimate
6:44 - 6:45

the standard deviation.
6:45 - 6:48

But what is the difference
between the standard deviation
6:48 - 6:49

and the variance?
6:49 - 6:52

As we now know, the
standard deviation
6:52 - 6:56

is the quadratic mean of
the distance from the mean.
6:56 - 6:59

The variance now is the
squared standard deviation.
6:59 - 7:02

If you want to know more details
about the standard deviation
7:02 - 7:05

and the variance,
please watch our video.
7:05 - 7:08

Let's move on to range
and interquartile range.
7:08 - 7:10

It is easy to understand.
7:10 - 7:12

The range is simply
the difference
7:12 - 7:16

between the maximum
and minimum value.
7:16 - 7:20

Interquartile range represents
the middle 50% of the data.
7:20 - 7:24

It is the difference between
the first quartile, Q1,
7:24 - 7:27

and the third quartile, Q3.
7:27 - 7:31

Therefore, 25% of the
values are smaller
7:31 - 7:36

than the interquartile range and
25% of the values are larger.
7:36 - 7:39

The interquartile
range contains exactly
7:39 - 7:42

the middle 50% of the values.
7:42 - 7:44

Before we get to
the last two points,
7:44 - 7:47

let's briefly compare
measures of central tendency
7:47 - 7:49

and measures of dispersion.
7:49 - 7:53

Let's say we measure the
blood pressure of patients.
7:53 - 7:56

Measures of central tendency
provide a single value
7:56 - 7:59

that represents the
entire data set,
7:59 - 8:03

helping to identify a central
value around which data
8:03 - 8:05

points tend to cluster.
8:05 - 8:09

Measures of dispersion, like the
standard deviation, the range,
8:09 - 8:12

and the interquartile
range, indicate
8:12 - 8:15

how spread out the
data points are,
8:15 - 8:18

whether they are closely
packed around the center
8:18 - 8:19

or spread far from it.
8:19 - 8:22

In summary, while measures
of central tendency
8:22 - 8:27

provide a central point of the
data set, measures of dispersion
8:27 - 8:30

describe how the data is
spread around the center.
8:30 - 8:32

Let's move on to tables.
8:32 - 8:36

Here we will have a look at the
most important ones, frequency
8:36 - 8:38

tables and contingency tables.
8:38 - 8:43

A frequency table displays
how often each distinct value
8:43 - 8:45

appears in a data set.
8:45 - 8:48

Let's have a closer look at
the example from the beginning.
8:48 - 8:52

A company surveyed its
employees to find out
8:52 - 8:53

how they get to work.
8:53 - 8:57

The options given were
car, bicycle, walk,
8:57 - 8:58

and public transport.
8:58 - 9:01

Here are the results
from 30 employees.
9:01 - 9:06

The first answered car, the next
walk, and so on and so forth.
9:06 - 9:10

Now we can create a frequency
table to summarize this data.
9:10 - 9:15

To do this, we simply enter
the four possible options, car,
9:15 - 9:19

bicycle, walk, and public
transport in the first column,
9:19 - 9:22

and then count how
often they occurred.
9:22 - 9:26

From the table, it is evident
that the most common mode
9:26 - 9:30

of transport among the employees
is by car, with 14 employees
9:30 - 9:32

preferring it.
9:32 - 9:35

The frequency table thus
provides a clear and concise
9:35 - 9:37

summary of the data.
9:37 - 9:39

But what if we
have not only one,
9:39 - 9:42

but two categorical variables?
9:42 - 9:45

This is where the contingency
table, also called crosstab,
9:45 - 9:46

comes in.
9:46 - 9:50

Imagine the company doesn't
have one factory, but two.
9:50 - 9:53

One in Detroit and
one in Cleveland.
9:53 - 9:57

So we also asked the employees
at which location they work.
9:57 - 10:00

If we want to display
both variables,
10:00 - 10:03

we can use a contingency table.
10:03 - 10:07

A contingency table provides
a way to analyse and compare
10:07 - 10:10

the relationship between
two categorical variables.
10:10 - 10:14

The rows of a contingency
table represent the categories
10:14 - 10:18

of one variable, while the
columns represent the categories
10:18 - 10:20

of another variable.
10:20 - 10:23

Each cell in the
table shows the number
10:23 - 10:26

of observations that fall into
the corresponding category
10:26 - 10:27

combination.
10:27 - 10:31

For example, the first cell
shows that car and Detroit
10:31 - 10:33

were answered six times.
10:33 - 10:35

And what about the charts?
10:35 - 10:38

Let's take a look at
the most important ones.
10:38 - 10:41

To do this, let's
simply use datatab.net.
10:41 - 10:44

If you like, you can
load this sample data
10:44 - 10:47

set with the link in
the video description.
10:47 - 10:50

Or you just copy your
own data into this table.
10:50 - 10:54

Here below you can see the
variables-- distance to work,
10:54 - 10:56

mode of transport, and site.
10:56 - 10:59

Datatab gives you a hint about
the level of measurement,
10:59 - 11:02

but you can also change it here.
11:02 - 11:05

Now, if we only click
on Mode of Transport,
11:05 - 11:08

we get a frequency
table and we can also
11:08 - 11:11

display the percentage values.
11:11 - 11:16

If we scroll down, we get a
bar chart and a pie chart.
11:16 - 11:19

Here on the left, we can
adjust for the settings.
11:19 - 11:22

For example, we can
specify whether we
11:22 - 11:26

want to display the frequencies
or the percentage values,
11:26 - 11:31

or whether the bars should
be vertical or horizontal.
11:31 - 11:35

If you also select Site,
we get a cross-table here
11:35 - 11:39

and a grouped bar
chart for the diagrams.
11:39 - 11:42

Here we can specify
whether we want the chart
11:42 - 11:45

to be grouped or stacked.
11:45 - 11:48

If we click on Distance to
Work and Mode of Transport,
11:48 - 11:52

we get a bar chart where
the height of the bar
11:52 - 11:55

shows the mean value of
the individual groups.
11:55 - 11:59

Here we can also
display the dispersion.
11:59 - 12:03

We also get a histogram,
a box plot, a violin plot,
12:03 - 12:05

and a rainbow plot.
12:05 - 12:09

If you would like to know more
about what a box plot, a violin
12:09 - 12:13

plot, and a rainbow plot are,
take a look at my videos.
12:13 - 12:16

Let's continue with
inferential statistics.
12:16 - 12:18

At the beginning, we
briefly go through what
12:18 - 12:21

inferential statistics
is, and then I'll
12:21 - 12:24

explain the six key
components to you.
12:24 - 12:27

So what is inferential
statistics?
12:27 - 12:31

Inferential statistics allows
us to make a conclusion
12:31 - 12:36

or inference about a population
based on data from a sample.
12:36 - 12:39

What is the population,
and what is the sample?
12:39 - 12:43

The population is the whole
group we are interested in.
12:43 - 12:45

If you want to study,
the average height
12:45 - 12:48

of all adults in
a United States,
12:48 - 12:52

then a population would be all
adults in the United States.
12:52 - 12:55

The sample is a smaller
group we actually study
12:55 - 12:57

chosen from the population.
12:57 - 13:02

For example, 150 adults were
selected from the United States.
13:02 - 13:04

And now we want
to use the sample
13:04 - 13:07

to make a statement
about the population.
13:07 - 13:10

And here are the six
steps how to do that.
13:10 - 13:12

Number one, hypothesis.
13:12 - 13:16

First, we need a statement, a
hypothesis that we want to test.
13:16 - 13:19

For example, we want to
know whether a drug will
13:19 - 13:22

have a positive effect
on blood pressure
13:22 - 13:25

in people with high
blood pressure.
13:25 - 13:26

But what's next?
13:26 - 13:28

In our hypothesis,
we stated that we
13:28 - 13:31

would like to study people
with high blood pressure.
13:31 - 13:35

So our population is all
people with high blood pressure
13:35 - 13:37

in, for example, the US.
13:37 - 13:42

Obviously, we cannot collect
data from the whole population,
13:42 - 13:44

so we take a sample
from the population.
13:44 - 13:48

Now we use this sample to make a
statement about the population.
13:48 - 13:50

But how do we do that?
13:50 - 13:53

For this we need
a hypothesis test.
13:53 - 13:57

Hypothesis testing is a
method for testing a claim
13:57 - 14:00

about a parameter in
a population using
14:00 - 14:01

data, measured in a sample.
14:01 - 14:02

Great.
14:02 - 14:04

That's exactly what we need.
14:04 - 14:06

There are many different
hypothesis tests,
14:06 - 14:09

and at the end of this
video, I will give you
14:09 - 14:11

a guide on how to
find the right test.
14:11 - 14:13

And of course, you
can find videos
14:13 - 14:17

about many more hypothesis
tests on our channel.
14:17 - 14:19

But how does a
hypothesis test work?
14:19 - 14:22

When we conduct a
hypothesis test,
14:22 - 14:25

we start with the research
hypothesis, also called
14:25 - 14:27

alternative hypothesis.
14:27 - 14:31

This is the hypothesis we are
trying to find evidence for.
14:31 - 14:33

In our case, the
research hypothesis
14:33 - 14:36

is the drug has an
effect on blood pressure.
14:36 - 14:40

But we cannot test this
hypothesis directly with
14:40 - 14:42

the classical hypothesis test.
14:42 - 14:44

So we test the
opposite hypothesis
14:44 - 14:47

that the drug has no
effect on blood pressure.
14:47 - 14:49

But what does that mean?
14:49 - 14:54

First, we assume that the drug
has no effect in the population.
14:54 - 14:56

We therefore assume
that, in general,
14:56 - 15:00

people who take the drug and
people who don't take the drug
15:00 - 15:03

have the same blood
pressure on average.
15:03 - 15:06

If we now take a random
sample and it turns out
15:06 - 15:09

that the drug has a large
effect in the sample,
15:09 - 15:15

then we can ask how likely it
is to draw such a sample, or one
15:15 - 15:20

that deviates even more if the
drug actually has no effect.
15:20 - 15:23

So in reality, on average,
there is no difference
15:23 - 15:24

in the population.
15:24 - 15:29

If this probability is very
low, we can ask ourselves maybe
15:29 - 15:32

the drug has an effect
in the population,
15:32 - 15:36

and we may have enough evidence
to reject the null hypothesis
15:36 - 15:38

that the drug has no effect.
15:38 - 15:42

And it is this probability
that is called the p-value.
15:42 - 15:45

Let's summarize this
in three simple steps.
15:45 - 15:48

Number one, the null
hypothesis states
15:48 - 15:51

that there is no difference
in the population.
15:51 - 15:54

Number two, the
hypothesis test calculates
15:54 - 15:58

how much the sample deviates
from the null hypothesis.
15:58 - 16:02

Number three, the p-value
indicates the probability
16:02 - 16:07

of getting a sample that
deviates as much as our sample,
16:07 - 16:10

or one that even deviates
more than our sample,
16:10 - 16:13

assuming the null
hypothesis is true.
16:13 - 16:17

But at what point is the
p-value small enough for us
16:17 - 16:19

to reject the null hypothesis?
16:19 - 16:23

This brings us to the next
point, statistical significance.
16:23 - 16:27

If the p-value is less than
a predetermined threshold,
16:27 - 16:30

the result is considered
statistically significant.
16:30 - 16:34

This means that the result
is unlikely to have occurred
16:34 - 16:37

by chance alone, and that
we have enough evidence
16:37 - 16:39

to reject the null hypothesis.
16:39 - 16:43

This threshold is often 0.05.
16:43 - 16:45

Therefore, a small
p-value suggests
16:45 - 16:48

that the observed
data our sample
16:48 - 16:51

is inconsistent with
the null hypothesis.
16:51 - 16:54

This leads us to reject the
null hypothesis in favor
16:54 - 16:56

of the alternative hypothesis.
16:56 - 16:59

A large p-value suggests
that the observed data
16:59 - 17:02

is consistent with
the null hypothesis,
17:02 - 17:04

and we will not reject it.
17:04 - 17:07

But note, there is always
a risk of making an error.
17:07 - 17:11

A small p-value does not prove
that the alternative hypothesis
17:11 - 17:12

is true.
17:12 - 17:16

It is only saying that it is
unlikely to get such a result,
17:16 - 17:21

or a more extreme when the
null hypothesis is true.
17:21 - 17:23

And again, if the null
hypothesis is true,
17:23 - 17:26

there is no difference
in the a population.
17:26 - 17:29

And the other way
around, a large p-value
17:29 - 17:32

does not prove that the
null hypothesis is true.
17:32 - 17:36

It is only saying that it is
likely to get such a result,
17:36 - 17:40

or a more extreme when the
null hypothesis is true.
17:40 - 17:42

So there are two
types of errors,
17:42 - 17:45

which are called type
I and type II error.
17:45 - 17:47

Let's start with
the type I error.
17:47 - 17:50

In hypothesis testing,
a type I error
17:50 - 17:54

occurs when a true null
hypothesis is rejected.
17:54 - 17:58

So in reality, the null
hypothesis is true,
17:58 - 18:01

but we make the decision to
reject the null hypothesis.
18:01 - 18:06

In our example, it means that
the drug actually had no effect.
18:06 - 18:10

So in reality, there is no
difference in blood pressure.
18:10 - 18:12

Whether the drug
is taken or not,
18:12 - 18:15

the blood pressure remains
the same in both cases.
18:15 - 18:19

But our sample happened to
be so far off the true value
18:19 - 18:23

that we mistakenly thought
the drug was working,
18:23 - 18:28

and a type II error occurs when
a false null hypothesis is not
18:28 - 18:29

rejected.
18:29 - 18:32

So in reality, the null
hypothesis is false,
18:32 - 18:36

but we make the decision not
to reject the null hypothesis.
18:36 - 18:40

In our example, this means
the drug actually did work.
18:40 - 18:42

There is a difference
between those
18:42 - 18:45

who have taken the drug
and those who have not.
18:45 - 18:50

But it was just a coincidence
that the sample taken did not
18:50 - 18:53

show much difference,
and we mistakenly
18:53 - 18:56

thought the drug
was not working.
18:56 - 18:58

And now I'll show
you how Datatab
18:58 - 19:01

helps you to find a
suitable hypothesis test,
19:01 - 19:05

and of course, calculates it and
interprets the results for you.
19:05 - 19:10

Let's go to datatab.net and
copy your own data in here.
19:10 - 19:13

We will just use this
example data set.
19:13 - 19:15

After copying your
data into the table,
19:15 - 19:18

the variables appear down here.
19:18 - 19:21

Datatab automatically
tries to determine
19:21 - 19:24

the correct level
of measurement,
19:24 - 19:27

but you can also
change it up here.
19:27 - 19:32

Now we just click on hypothesis
testing and select the variables
19:32 - 19:34

we want to use for
the calculation
19:34 - 19:36

of a hypothesis test.
19:36 - 19:40

Datatab will then suggest a
suitable test, for example,
19:40 - 19:44

in this case, a chi-square
test, or in that case,
19:44 - 19:47

an analysis of variance.
19:47 - 19:51

Then you will see the
hypotheses and the results.
19:51 - 19:54

If you're not sure how
to interpret the results,
19:54 - 19:56

click on Summary in words.
19:56 - 19:59

Further, you can
check the assumptions
19:59 - 20:03

and decide whether you want
to calculate a parametric
20:03 - 20:05

or a nonparametric test.
20:05 - 20:07

You can find out the
difference between
20:07 - 20:12

parametric and nonparametric
tests in my next video.
20:12 - 20:16

Thanks for watching, and I
hope you enjoyed the video.
20:16 - 20:20

Title:: What is Statistics? A Beginner's Guide to Statistics (Data Analytics)!
Description:: more » « less
Video Language:: English
Duration:: 20:21

TTU_OAL edited English subtitles for What is Statistics? A Beginner's Guide to Statistics (Data Analytics)!

English subtitles

Revisions

Revision 1 Uploaded

TTU_OAL

What is Statistics? A Beginner's Guide to Statistics (Data Analytics)!

Revisions

Our website uses cookies

Operating cookies (Required)