-
PROFESSOR: If you want to
finally understand statistics,
-
this is the place to be.
-
After this video, you will
know what statistics is,
-
what descriptive
statistics is, and what
-
inferential statistics is.
-
So let's start with
the first question.
-
What is statistics?
-
Statistics deals with
the collection, analysis,
-
and presentation of data.
-
An example.
-
We would like to investigate
whether gender has an influence
-
on the preferred newspaper.
-
Then gender and newspaper are
our so-called variables that we
-
want to analyse.
-
In order to analyse whether
gender has an influence
-
on the preferred newspaper,
we first need to collect data.
-
To do this, we create
a questionnaire
-
that asks about gender
and preferred newspaper.
-
We will then send out the
survey and wait two weeks.
-
Afterwards, we can display the
received answers in a table.
-
In this table, we have one
column for each variable,
-
one for gender and
one for newspaper.
-
On the other hand, each
row is the response
-
of one surveyed person.
-
The first respondent is male
and stated New York Post,
-
the second is female
and stated USA Today,
-
and so on and so forth.
-
Of course, the data does not
have to be from a survey.
-
The data can also come from
an experiment in which you,
-
for example, want to study the
effect of two drugs on blood
-
pressure.
-
Now, the first step is done.
-
We have collected data and we
can start analyzing the data.
-
But what do we actually
want to analyse?
-
We did not survey the
entire population,
-
but we took a sample.
-
Now the big question
is, do we just
-
want to describe
the sample data,
-
or do we want to
make a statement
-
about the whole population?
-
If our aim is limited to the
sample itself, i.e. we only
-
want to describe
the collected data,
-
we will use
descriptive statistics.
-
Descriptive statistics will
provide a detailed summary
-
of the sample.
-
However, if we want
to draw conclusions
-
about the population as a whole,
inferential statistics are used.
-
This approach allows us
to make educated guesses
-
about the population
based on the sample data.
-
Let us take a closer
look at both methods,
-
starting with
descriptive statistics.
-
Why is descriptive
statistics so important?
-
Let's say a company
wants to know how
-
its employees travel to work.
-
So the company creates a
survey to answer this question.
-
Once enough data
has been collected,
-
this data can be analyzed
using descriptive statistics.
-
But what is
descriptive statistics?
-
Descriptive statistics aims
to describe and summarize
-
a data set in a meaningful way.
-
But it is important to note
that descriptive statistics only
-
describe the collected data
without drawing conclusions
-
about a larger population.
-
Put simply, just because we know
how some people from one company
-
get to work, we cannot
say how all working people
-
of the company get to work.
-
This is the task of
inferential statistics,
-
which we will discuss later.
-
To describe data
descriptively, we now
-
look at the four
key components--
-
measures of central tendency,
measures of dispersion,
-
frequency tables and charts.
-
Let's start with the first one,
measures of central tendency.
-
Measures of central
tendency are, for example,
-
the mean, the
median, and the mode.
-
Let's first have a
look at the mean.
-
The arithmetic mean is the sum
of all observations divided
-
by the number of observations.
-
An example.
-
Imagine we have the test
scores of five students.
-
To find a mean score,
we sum up all the scores
-
and divide by the
number of scores.
-
The mean test score of these
five students is therefore 86.6.
-
What about the median?
-
When the values in a data set
are arranged in ascending order,
-
the median is the middle value.
-
If there is an odd
number of data points,
-
the median is simply
the middle value.
-
If there is an even
number of data points,
-
the median is the average
of the two middle values.
-
It is important to
note that the median is
-
resistant to extreme
values or outliers.
-
Let's look at this example.
-
No matter how tall
the last person is,
-
the person in the middle remains
the person in the middle.
-
So the median does not change.
-
But if we look at the mean,
it does have an effect
-
on how tall the last person is.
-
The mean is therefore
not robust to outliers.
-
Let's continue with the mode.
-
The mode refers to
the value or values
-
that appear most frequently
in a set of data.
-
For example, if 14 people
travel to work by car,
-
six by bike, five walk, and
five take public transport,
-
then car occurs most often
and is therefore the mode.
-
Great.
-
Let's continue with the
measures of dispersion.
-
Measures of dispersion
describe how spread out
-
the values in a data set are.
-
Measures of dispersion
are, for example,
-
the variance and standard
deviation, the range,
-
and the interquartile range.
-
Let's start with the
standard deviation.
-
The standard deviation
indicates the average distance
-
between each data
point and the mean.
-
But what does that mean?
-
Each person has some
deviation from the mean.
-
Now we want to know
how much the person's
-
deviate from the mean
value on average.
-
In this example, the average
deviation from the mean value
-
is 11.5 centimeters.
-
To calculate the
standard deviation,
-
we can use this equation.
-
Sigma is the standard deviation,
n is the number of persons,
-
xi is the size of
each person, and x bar
-
is the mean value
of all persons.
-
But attention, there are two
slightly different equations
-
for the standard deviation.
-
The difference is that we
have ones, 1 divided by n,
-
and ones, 1 divided
by n minus 1.
-
To keep it simple,
if our survey doesn't
-
cover the whole
population, we always
-
use this equation to estimate
the standard deviation.
-
Likewise, if we have
conducted a clinical study,
-
then we also use this
equation to estimate
-
the standard deviation.
-
But what is the difference
between the standard deviation
-
and the variance?
-
As we now know, the
standard deviation
-
is the quadratic mean of
the distance from the mean.
-
The variance now is the
squared standard deviation.
-
If you want to know more details
about the standard deviation
-
and the variance,
please watch our video.
-
Let's move on to range
and interquartile range.
-
It is easy to understand.
-
The range is simply
the difference
-
between the maximum
and minimum value.
-
Interquartile range represents
the middle 50% of the data.
-
It is the difference between
the first quartile, Q1,
-
and the third quartile, Q3.
-
Therefore, 25% of the
values are smaller
-
than the interquartile range and
25% of the values are larger.
-
The interquartile
range contains exactly
-
the middle 50% of the values.
-
Before we get to
the last two points,
-
let's briefly compare
measures of central tendency
-
and measures of dispersion.
-
Let's say we measure the
blood pressure of patients.
-
Measures of central tendency
provide a single value
-
that represents the
entire data set,
-
helping to identify a central
value around which data
-
points tend to cluster.
-
Measures of dispersion, like the
standard deviation, the range,
-
and the interquartile
range, indicate
-
how spread out the
data points are,
-
whether they are closely
packed around the center
-
or spread far from it.
-
In summary, while measures
of central tendency
-
provide a central point of the
data set, measures of dispersion
-
describe how the data is
spread around the center.
-
Let's move on to tables.
-
Here we will have a look at the
most important ones, frequency
-
tables and contingency tables.
-
A frequency table displays
how often each distinct value
-
appears in a data set.
-
Let's have a closer look at
the example from the beginning.
-
A company surveyed its
employees to find out
-
how they get to work.
-
The options given were
car, bicycle, walk,
-
and public transport.
-
Here are the results
from 30 employees.
-
The first answered car, the next
walk, and so on and so forth.
-
Now we can create a frequency
table to summarize this data.
-
To do this, we simply enter
the four possible options, car,
-
bicycle, walk, and public
transport in the first column,
-
and then count how
often they occurred.
-
From the table, it is evident
that the most common mode
-
of transport among the employees
is by car, with 14 employees
-
preferring it.
-
The frequency table thus
provides a clear and concise
-
summary of the data.
-
But what if we
have not only one,
-
but two categorical variables?
-
This is where the contingency
table, also called crosstab,
-
comes in.
-
Imagine the company doesn't
have one factory, but two.
-
One in Detroit and
one in Cleveland.
-
So we also asked the employees
at which location they work.
-
If we want to display
both variables,
-
we can use a contingency table.
-
A contingency table provides
a way to analyse and compare
-
the relationship between
two categorical variables.
-
The rows of a contingency
table represent the categories
-
of one variable, while the
columns represent the categories
-
of another variable.
-
Each cell in the
table shows the number
-
of observations that fall into
the corresponding category
-
combination.
-
For example, the first cell
shows that car and Detroit
-
were answered six times.
-
And what about the charts?
-
Let's take a look at
the most important ones.
-
To do this, let's
simply use datatab.net.
-
If you like, you can
load this sample data
-
set with the link in
the video description.
-
Or you just copy your
own data into this table.
-
Here below you can see the
variables-- distance to work,
-
mode of transport, and site.
-
Datatab gives you a hint about
the level of measurement,
-
but you can also change it here.
-
Now, if we only click
on Mode of Transport,
-
we get a frequency
table and we can also
-
display the percentage values.
-
If we scroll down, we get a
bar chart and a pie chart.
-
Here on the left, we can
adjust for the settings.
-
For example, we can
specify whether we
-
want to display the frequencies
or the percentage values,
-
or whether the bars should
be vertical or horizontal.
-
If you also select Site,
we get a cross-table here
-
and a grouped bar
chart for the diagrams.
-
Here we can specify
whether we want the chart
-
to be grouped or stacked.
-
If we click on Distance to
Work and Mode of Transport,
-
we get a bar chart where
the height of the bar
-
shows the mean value of
the individual groups.
-
Here we can also
display the dispersion.
-
We also get a histogram,
a box plot, a violin plot,
-
and a rainbow plot.
-
If you would like to know more
about what a box plot, a violin
-
plot, and a rainbow plot are,
take a look at my videos.
-
Let's continue with
inferential statistics.
-
At the beginning, we
briefly go through what
-
inferential statistics
is, and then I'll
-
explain the six key
components to you.
-
So what is inferential
statistics?
-
Inferential statistics allows
us to make a conclusion
-
or inference about a population
based on data from a sample.
-
What is the population,
and what is the sample?
-
The population is the whole
group we are interested in.
-
If you want to study,
the average height
-
of all adults in
a United States,
-
then a population would be all
adults in the United States.
-
The sample is a smaller
group we actually study
-
chosen from the population.
-
For example, 150 adults were
selected from the United States.
-
And now we want
to use the sample
-
to make a statement
about the population.
-
And here are the six
steps how to do that.
-
Number one, hypothesis.
-
First, we need a statement, a
hypothesis that we want to test.
-
For example, we want to
know whether a drug will
-
have a positive effect
on blood pressure
-
in people with high
blood pressure.
-
But what's next?
-
In our hypothesis,
we stated that we
-
would like to study people
with high blood pressure.
-
So our population is all
people with high blood pressure
-
in, for example, the US.
-
Obviously, we cannot collect
data from the whole population,
-
so we take a sample
from the population.
-
Now we use this sample to make a
statement about the population.
-
But how do we do that?
-
For this we need
a hypothesis test.
-
Hypothesis testing is a
method for testing a claim
-
about a parameter in
a population using
-
data, measured in a sample.
-
Great.
-
That's exactly what we need.
-
There are many different
hypothesis tests,
-
and at the end of this
video, I will give you
-
a guide on how to
find the right test.
-
And of course, you
can find videos
-
about many more hypothesis
tests on our channel.
-
But how does a
hypothesis test work?
-
When we conduct a
hypothesis test,
-
we start with the research
hypothesis, also called
-
alternative hypothesis.
-
This is the hypothesis we are
trying to find evidence for.
-
In our case, the
research hypothesis
-
is the drug has an
effect on blood pressure.
-
But we cannot test this
hypothesis directly with
-
the classical hypothesis test.
-
So we test the
opposite hypothesis
-
that the drug has no
effect on blood pressure.
-
But what does that mean?
-
First, we assume that the drug
has no effect in the population.
-
We therefore assume
that, in general,
-
people who take the drug and
people who don't take the drug
-
have the same blood
pressure on average.
-
If we now take a random
sample and it turns out
-
that the drug has a large
effect in the sample,
-
then we can ask how likely it
is to draw such a sample, or one
-
that deviates even more if the
drug actually has no effect.
-
So in reality, on average,
there is no difference
-
in the population.
-
If this probability is very
low, we can ask ourselves maybe
-
the drug has an effect
in the population,
-
and we may have enough evidence
to reject the null hypothesis
-
that the drug has no effect.
-
And it is this probability
that is called the p-value.
-
Let's summarize this
in three simple steps.
-
Number one, the null
hypothesis states
-
that there is no difference
in the population.
-
Number two, the
hypothesis test calculates
-
how much the sample deviates
from the null hypothesis.
-
Number three, the p-value
indicates the probability
-
of getting a sample that
deviates as much as our sample,
-
or one that even deviates
more than our sample,
-
assuming the null
hypothesis is true.
-
But at what point is the
p-value small enough for us
-
to reject the null hypothesis?
-
This brings us to the next
point, statistical significance.
-
If the p-value is less than
a predetermined threshold,
-
the result is considered
statistically significant.
-
This means that the result
is unlikely to have occurred
-
by chance alone, and that
we have enough evidence
-
to reject the null hypothesis.
-
This threshold is often 0.05.
-
Therefore, a small
p-value suggests
-
that the observed
data our sample
-
is inconsistent with
the null hypothesis.
-
This leads us to reject the
null hypothesis in favor
-
of the alternative hypothesis.
-
A large p-value suggests
that the observed data
-
is consistent with
the null hypothesis,
-
and we will not reject it.
-
But note, there is always
a risk of making an error.
-
A small p-value does not prove
that the alternative hypothesis
-
is true.
-
It is only saying that it is
unlikely to get such a result,
-
or a more extreme when the
null hypothesis is true.
-
And again, if the null
hypothesis is true,
-
there is no difference
in the a population.
-
And the other way
around, a large p-value
-
does not prove that the
null hypothesis is true.
-
It is only saying that it is
likely to get such a result,
-
or a more extreme when the
null hypothesis is true.
-
So there are two
types of errors,
-
which are called type
I and type II error.
-
Let's start with
the type I error.
-
In hypothesis testing,
a type I error
-
occurs when a true null
hypothesis is rejected.
-
So in reality, the null
hypothesis is true,
-
but we make the decision to
reject the null hypothesis.
-
In our example, it means that
the drug actually had no effect.
-
So in reality, there is no
difference in blood pressure.
-
Whether the drug
is taken or not,
-
the blood pressure remains
the same in both cases.
-
But our sample happened to
be so far off the true value
-
that we mistakenly thought
the drug was working,
-
and a type II error occurs when
a false null hypothesis is not
-
rejected.
-
So in reality, the null
hypothesis is false,
-
but we make the decision not
to reject the null hypothesis.
-
In our example, this means
the drug actually did work.
-
There is a difference
between those
-
who have taken the drug
and those who have not.
-
But it was just a coincidence
that the sample taken did not
-
show much difference,
and we mistakenly
-
thought the drug
was not working.
-
And now I'll show
you how Datatab
-
helps you to find a
suitable hypothesis test,
-
and of course, calculates it and
interprets the results for you.
-
Let's go to datatab.net and
copy your own data in here.
-
We will just use this
example data set.
-
After copying your
data into the table,
-
the variables appear down here.
-
Datatab automatically
tries to determine
-
the correct level
of measurement,
-
but you can also
change it up here.
-
Now we just click on hypothesis
testing and select the variables
-
we want to use for
the calculation
-
of a hypothesis test.
-
Datatab will then suggest a
suitable test, for example,
-
in this case, a chi-square
test, or in that case,
-
an analysis of variance.
-
Then you will see the
hypotheses and the results.
-
If you're not sure how
to interpret the results,
-
click on Summary in words.
-
Further, you can
check the assumptions
-
and decide whether you want
to calculate a parametric
-
or a nonparametric test.
-
You can find out the
difference between
-
parametric and nonparametric
tests in my next video.
-
Thanks for watching, and I
hope you enjoyed the video.
-