Biostatistics I: Unit 01 - Populations
and Samples Notes
Slide 1
In this learning unit, we will cover populations and samples.
Slide 2
The most important organizing concept of biostatistics
is the notion of populations and samples.
Using samples to make inferences about populations can save us inordinate
amounts of time and effort.
Therefore, an understanding of the differences between populations and samples
is essential.
Slide 3
I mentioned earlier that we could learn about the attitudes of students at a
university about a rise
in tuition by asking every one of the students to respond to a questionnaire or
alternatively by
taking a small sample of students and asking the same questions. The first
approach is a population
approach – that is, we gather information on everyone in the population. The
second is a sampling
approach. If we had access to everyone in every population and could obtain
data from them,
there would be no need for statistics. We wouldn’t have to estimate anything –
because we
could describe the population with certainty.
Slide 4
After this unit, you will understand the differences
between populations and samples as well as
the different ways populations and samples are described in biostatistics.
Slide 5
Let’s start with populations. There are two types of
populations: popular and statistical.
Each has unique characteristics.
Slide 6
Even statisticians disagree about what a population is. The common use of the
term population
is a set of persons or things – Population defined in this way are called
popular populations.
They might include the population of students at a university, the population
of seniors in the
United States, the population of persons in Florida who if tested would be
positive for HIV,
the population of alligators in Florida, or the population of deer in Michigan
that carry
the tick responsible for Lyme’s disease.
Slide 7
The second kind of population called a statistical
population is made up of characteristics
of persons or things. Statistical populations, for example, could include the
set of all blood
pressures of students at a university, the set of antibody titers against HIV
of persons living
in
living in
Slide 8 Self-assessment
Slide 9
Both uses of the word population – popular and
statistical populations -- are correct, but there
are important reasons to distinguish them. When we are analyzing a
characteristic such as the titer
of HIV antibodies, we might know some more things about the persons with
different titers of
antibodies, such as their age or place of residence, but generally there are a
lot of things about
them that we don’t know. We might not know their sexual history or sexual
preference, for example.
The analyses we conduct in biostatistics are based on what is observed. Other
characteristics that
we do not observe are extraneous to our analyses and tests. Therefore, often
what we are studying
is not the people themselves, but rather some observable characteristics of
these people. In that sense,
we are studying statistical rather than popular populations.
While we usually study statistical populations, it is permissible to use the
popular population to
describe the entity being studied, as long as it is understood that it is the
characteristics of individuals
in this population that is the object of study, rather than the individuals
themselves.
Slide 10
Before we move on, we need to introduce 2 new terms:
data and variables. We use the word data to
refer to recordings of measurements made on characteristics. For example, the
names of people and their
blood pressures constitute data. While characteristics could be constant (for
example, the presence of a
brain in all humans), they are more likely to take on different values (for
example, names or blood pressures)
in which case we refer to them as variables, meaning simply entities that can
vary. Note that the name of a
particular person does not vary and therefore is not a variable, but his or her
blood pressure does vary from
time to time and therefore is a variable.
Slide 11
Now that we have introduced populations, data and variables, we can turn to
samples. A sample is simply
any subset of a population. For example, it could be the sample of patients
with lung cancer seen
at a regional cancer center this past year.
Slide 12
Such a sample may have unique characteristics that
distinguishes it from the entire population of people
with lung cancer. For example, the patterns of referral to the regional cancer
center may yield
a younger, older, more or less affluent group of patients. Such samples are
said to be biased,
that is to not reflect the population of all such patients fairly. For the
purposes of this course
we will focus on simple random samples, where the members of the sample can be
thought of
as being drawn from a hat from all possible members.
Slide 13 Self-assessment
Slide 14
Most of the populations we concern ourselves with in epidemiology are very
large, and for purposes of the
analyses we conduct can be considered to be infinite in size; most of the
samples we use are small.
Slide 15
Populations in some cases may not be well defined or able to be enumerated. We
can still make inferences
from a sample of a population to that population even though the population is
not well defined.
Slide 16
We can define the following population: all people
currently living in
Alzheimer’s disease. Although we know that such a population exists, we cannot
enumerate it, because
we don’t know who these people are. However, we could select a sample of 100
individuals over 65
to whom we can administer a medication to prevent Alzheimer’s disease. The
sample is well defined,
because we can enumerate who is and who is not in the sample. We don’t need to
be able to identify
everyone in a population to make inferences about the efficacy of a preventive
medication in that population.
Slide 17 Self-assessment
Slide 18
There are 2 types of populations: popular and statistical. Often what we are
studying is not the people
themselves, but rather some observable characteristics of these people. Samples
are subsets of populations.
Most populations are very large; most samples are small. We do not need to be able
to enumerate a
population to be able to make inferences about its characteristics from
samples.