Unit 21 (presentation): Sampling Distributions & Central Limit Theorem Notes

 

The kinds of questions we have dealt with so far are of the type: What is the probability that an individual chosen at random from a population has a systolic BP of 130 or greater? Despite the fact that we are using probability, this type of question is descriptive. It tells you something about the population and utilizes population information, but it does not allow you to make the kind of inferences that answer important scientific questions.

 

When we do research studies, we are generally not interested in answering questions about individuals. Our interest is in the characteristics of samples and what they tell us about the underlying population. In a clinical trial, for example, we might be interested in knowing whether the mean blood pressure of a sample treated with a new medication for high blood pressure is lower than that of a sample not treated with this medication. Since individuals may vary in their response to treatment, asking questions about such individuals is unlikely to provide reliable information on the effectiveness of the medication.

 

So..in order to use inferential statistics, we need to concentrate on the characteristics of samples.  These characteristics could include several different types of sample statistics, including the mean, the standard deviation and the median. Over the next several classes, we are going to talk about statistical tests that rely primarily on sample means. Just like we can use a distribution of observations in a population to ask questions about individuals, we can use a distribution of sample means to ask questions about samples.

 

Suppose we choose a sample at random of 50 students from a large university and compute the mean systolic blood pressure of that group. Then we put them back into the pool and choose another sample of 50 and do the same thing, and we do this over and over and over again. Let’s say we do this a million times. We end up with 1,000,000 different independent estimates of the mean systolic

blood pressure of students at this university.  We could plot these estimates as a relative frequency distribution.  What we are plotting is not the blood pressures of individual students AS WE DID BEFORE, but rather the relative frequency distribution of means of random samples of 50 students each. The distribution we obtain is called the sampling distribution of the mean.

 

Now, it’s very unlikely that we would have the patience or the time to select 1,000,000 samples of 50 students each and measure their blood pressure. This would keep us out of trouble for a few lifetimes. A true sampling distribution would require us to obtain an infinite number of samples – this literally would take us forever.

 

The concept of a sampling distribution is just that—it is a concept. Theoretically, such a distribution exists, but we would prefer not to ever have to generate it. And it will turn out, because of something called the central limit theorem, that we will never have to select a very large number of samples and obtain their means in order to use the concept of a sampling distribution.  Because we will always be able to model sampling distributions of the mean using normal curves.

 

Incidently, we can define sampling distributions for other statistics as well including the median, the variance, the odds ratio, the risk ratio, and the 25th percentile. In fact, we can obtain a sampling distribution for any statistic that we can measure in a sample.  Most of these sampling distributions, though not all, will be normally distributed. For purposes of this course, we will be interested primarily in sampling distributions for the means of repeatedly drawn samples.

 

The sampling distribution of the mean has some interesting properties. It’s mean, which is referred to as the expected value of x bar, is the mean of the underlying population, µ.

 

The standard deviation of the sampling distribution of x bar is called the STANDARD ERROR OF THE MEAN and it differs from the population standard deviation sigma by being divided by the square root of the size of the samples that are repeatedly selected from the population, n.   It is very important to understand that this is not N, the number of observations in the population; rather it is n, the number of observations in each of the samples for which a mean was obtained to generate the sampling distribution.

 

Without deriving it algebraically, which is beyond this class, the reason that the standard deviation of the mean is the population standard deviation divided by the square root of n is the following:

For a population distribution, one is concerned about the variation of a set of numbers. For the sampling distribution, one is concerned about the variation in the means of samples, each of which takes the form of a number divided by n, the size of the sample. Consequently the variance is reduced by n compared to the population and the standard deviation is reduced by the square root of n.

 

You can think of this in the extreme. Suppose the size of each repeated drawn sample is 1. Then each sample is equivalent to a single observation in the parent population, and it is not surprising that the standard error of the mean is sigma divided by the square root of one or sigma itself.

 

 

On the other hand, assume that the sample size is the same as the population size.  Then every time I draw a sample, it is in fact the same sample. The means of each of these samples will be the same, meaning that the standard error of the mean or the standard deviation of the sampling distribution would be 0, that is, there should be no variation in the means of the samples. Because populations are usually considered to be large or infinite in size, we would divide the population standard deviation by the square root of a very large number and obtain 0 as the standard error of the mean. 

 

You will note that the STANDARD ERROR OF THE MEAN decreases as the size of the individual samples increases. Later on we will see that from a practical point of view this means that the larger the sample I take, the more reliable will be the estimate of population mean derived from the sample mean.

 

Obviously, if my sample were the whole population, as I just mentioned, then each sample mean is guaranteed to be exactly the same as the population mean and the standard error of that mean would be 0.

 

The critical step in this process, the one that allows me to never have to obtain a sampling distribution of the mean, is the CENTRAL LIMIT THEOREM, which states that sampling distributions of certain classes of statistics, including the mean and the median, will approach a normal distribution as the sample size increases regardless of the shape of the sampled population. The key is that each sample is drawn independently of the previous one from a population.

 

The reason this is true for the sampling distribution of the mean is that the means of repeatedly drawn random samples each represent the sum of a large number of independent events (the randomly selected observations from the population that constitute each sample). As we noted in a previous presentation on the normal distribution, if the elements of a sample or population represent the sum of a large number of independent events and there is no force operating to constrain the distribution obtained, then the relative frequency distribution of these elements will be normally distributed.  This is the case for the sampling distribution of the mean.

 

How many samples do you need to draw before the distribution of means of the samples approaches normality? The population distribution in the top panel is clearly very non-normal. However, even with a sample size of 10 for repeatedly drawn samples from this population, the distribution of the sample means is fairly normal, and for sample sizes of 30 and 50, it is remarkably normal.