Unit
21 (presentation): Sampling Distributions & Central Limit Theorem Notes
The kinds of questions we have dealt with so
far are of the type: What is the probability that an individual chosen at
random from a population has a systolic BP of 130 or greater? Despite the fact
that we are using probability, this type of question is descriptive. It tells
you something about the population and utilizes population information, but it
does not allow you to make the kind of inferences that answer important
scientific questions.
When we do research studies, we are
generally not interested in answering questions about individuals. Our interest
is in the characteristics of samples and what they tell us about the underlying
population. In a clinical trial, for example, we might be interested in knowing
whether the mean blood pressure of a sample treated with a new medication for
high blood pressure is lower than that of a sample not treated with this
medication. Since individuals may vary in their response to treatment, asking
questions about such individuals is unlikely to provide reliable information on
the effectiveness of the medication.
So..in order to use
inferential statistics, we need to concentrate on the characteristics of
samples. These characteristics could
include several different types of sample statistics, including the mean, the
standard deviation and the median. Over the next several classes, we are going
to talk about statistical tests that rely primarily on sample means. Just like
we can use a distribution of observations in a population to ask questions
about individuals, we can use a distribution of sample means to ask questions
about samples.
Suppose we choose a sample at random of 50
students from a large university and compute the mean systolic blood pressure
of that group. Then we put them back into the pool and choose another sample of
50 and do the same thing, and we do this over and over and over again. Let’s
say we do this a million times. We end up with 1,000,000 different independent
estimates of the mean systolic
blood pressure of students at this
university. We could plot these
estimates as a relative frequency distribution.
What we are plotting is not the blood pressures of individual students
AS WE DID BEFORE, but rather the relative frequency distribution of means of
random samples of 50 students each. The distribution we obtain is called the sampling distribution of the mean.
Now, it’s very unlikely that we would have
the patience or the time to select 1,000,000 samples of 50 students each and
measure their blood pressure. This would keep us out of trouble for a few
lifetimes. A true sampling distribution would require us to obtain an infinite
number of samples – this literally would take us forever.
The concept of a sampling distribution is
just that—it is a concept. Theoretically, such a distribution exists, but we
would prefer not to ever have to generate it. And it will turn out, because of
something called the central limit theorem, that we
will never have to select a very large number of samples and obtain their means
in order to use the concept of a sampling distribution. Because we will always be able to model
sampling distributions of the mean using normal curves.
Incidently, we can define sampling distributions for
other statistics as well including the median, the variance, the odds ratio,
the risk ratio, and the 25th percentile. In fact, we can obtain a sampling
distribution for any statistic that we can measure in a sample. Most of these sampling distributions, though
not all, will be normally distributed. For purposes of this course, we will be
interested primarily in sampling distributions for the means of repeatedly
drawn samples.
The sampling distribution of the mean has
some interesting properties. It’s mean, which is
referred to as the expected value of x bar, is the mean of the underlying
population, µ.
The standard deviation of the sampling
distribution of x bar is called the STANDARD ERROR OF THE MEAN and it differs
from the population standard deviation sigma by being divided by the square
root of the size of the samples that are repeatedly selected from the
population, n. It is very important to
understand that this is not N, the number of observations in the population;
rather it is n, the number of observations in each of the samples for which a
mean was obtained to generate the sampling distribution.
Without deriving it algebraically, which is
beyond this class, the reason that the standard deviation of the mean is the
population standard deviation divided by the square root of n is the following:
For a population distribution, one is
concerned about the variation of a set of numbers. For the sampling
distribution, one is concerned about the variation in the means of samples,
each of which takes the form of a number divided by n, the size of the sample.
Consequently the variance is reduced by n compared to the population and the
standard deviation is reduced by the square root of n.
You can think of this in the extreme.
Suppose the size of each repeated drawn sample is 1. Then each sample is
equivalent to a single observation in the parent population, and it is not
surprising that the standard error of the mean is sigma divided by the square
root of one or sigma itself.
On the other hand, assume that the sample
size is the same as the population size.
Then every time I draw a sample, it is in fact the same sample. The
means of each of these samples will be the same, meaning that the standard
error of the mean or the standard deviation of the sampling distribution would
be 0, that is, there should be no variation in the means of the samples.
Because populations are usually considered to be large or infinite in size, we
would divide the population standard deviation by the square root of a very
large number and obtain 0 as the standard error of the mean.
You will note that the STANDARD ERROR OF THE
MEAN decreases as the size of the individual samples increases. Later on we
will see that from a practical point of view this means that the larger the
sample I take, the more reliable will be the estimate of population mean
derived from the sample mean.
Obviously, if my sample were the whole
population, as I just mentioned, then each sample mean is guaranteed to be
exactly the same as the population mean and the standard error of that mean
would be 0.
The critical step in this process, the one
that allows me to never have to obtain a sampling distribution of the mean, is
the CENTRAL LIMIT THEOREM, which states that sampling distributions of certain
classes of statistics, including the mean and the median, will approach a
normal distribution as the sample size increases regardless of the shape of the
sampled population. The key is that each sample is drawn independently of the
previous one from a population.
The reason this is true for the sampling
distribution of the mean is that the means of repeatedly drawn random samples
each represent the sum of a large number of independent events (the randomly
selected observations from the population that constitute each sample). As we
noted in a previous presentation on the normal distribution, if the elements of
a sample or population represent the sum of a large number of independent
events and there is no force operating to constrain the distribution obtained,
then the relative frequency distribution of these elements will be normally
distributed. This is the case for the
sampling distribution of the mean.
How many samples do you need to draw before
the distribution of means of the samples approaches normality? The population
distribution in the top panel is clearly very non-normal. However, even with a
sample size of 10 for repeatedly drawn samples from this population, the
distribution of the sample means is fairly normal, and for sample sizes of 30
and 50, it is remarkably normal.