This
site is under construction. Please check back every few weeks for
updates
COMMON MISTEAKS MISTAKES IN
USING STATISTICS: Spotting and Avoiding Them
More Precise Definition of Simple
Random
Sample
In practice in applying statistical techniques, we are
interested
in random
variables defined on the population under study. In the
examples mentioned in the preceding page, we
might be interested in:
Example 1: The difference in
blood
pressure with and without taking a certain drug.
Example 2: The number of heart bypass surgeries
performed in a particular year, or the number of such surgeries that
are successful, or the number in which the patient has complications of
surgery, etc.
Example 3: The number that comes up on the die.
If we
take a sample of units from the population, we have a corresponding
sample of values of the random
variable.
For example, if the random variable (let's call it Y) is difference in
blood pressure with and without taking the drug, then the sample will
consist of values we can call
y1, y2,
...,
yn,
where
n = number of people in
the
sample from the population
and where
The people in the sample are
listed as person 1, person 2, etc.
y1 = the
difference in
blood pressures (that is, the value of Y) for the first person in the
sample,
y2 = the difference in blood pressures (that is,
the value of Y) for the second person in the sample
etc.
»»
If the next paragraph is too complex
or
mathematical for you, just skip to The Bottom Line below.
Abstractly, we can think of this situation as describing n random
variables Y1, Y2, ..., Yn as follows:
Y1 is defined as
the
value of Y for the first person in a sample of the population;
Y1 is defined as
the
value of Y for the first person in a sample of the population;
etc.
The difference between using the small y's and the large Y's
is
that when we use the small y's we
are thinking of a fixed sample of size n from the population, but when
we use the large Y's, we are thinking of letting the sample vary
(but always with size n).
Precise definition of
simple
random sample of a random variable:
"The sample y1,
y2, ..., yn is a simple random sample" means that
the associated random variables Y1,
Y2,
..., Yn are
independent.
Intuitively, "independent" means that the values of any subset of the
random variables Y1,
Y2, ..., Yn
do not influence the values of the other
random
variables in the list. This is the mathematical formulation of "random
process" in this situation.
Connection with the initial
definition of simple random sample
Recall Example 3 above:
We are tossing a die;
the
number that comes up on the die is our random variable Y.
If we use the Moore and McCabe's definition
of simple random sample, our population is all possible tosses of the
die. Our simple random sample is n different tosses. The different
tosses of the die are independent events, which means that in the
precise definition above, the random variables Y1,
Y2, ..., Yn are indeed independent:
The numbers that come up in some tosses in no way influence the numbers
that come up in other tosses.
Compare this with example 2:
Our population is all
hospitals in
the U.S. that perform heart bypass surgery.
If we use Moore and McCabe's
definition of simple random sample of size n, we end up
with n distinct
hospitals.
This means that when we have chosen the first hospital in our simple
random sample, we cannot choose it again to be in our simple random
sample. Thus the events "Choose the first hospital in the sample;
choose the second hospital in the sample; ... ," are not independent events: The choice
of first hospital restricts the choice of the second and subsequent
hospitals in the sample.
If we now consider the random variable Y = the number of heart bypass
surgeries performed in 2008, then a consequence is that the
random
variables Y1, Y2, ..., Yn
are not independent.
The
Bottom Line:
In
many
cases, the Moore-McCabe definition does not coincide with the more precise
definition.
More
specifically, the
Moore-McCabe definition allows sampling without replacement, whereas the
more precise definition requires sampling with replacement.
The
Bad News: The precise
definition is the one that is used in the mathematical theorems that
justify the procedures of statistical inference.
The Good
News:
If the population is large
enough,
the Moore -McCabe definition is close enough for all practical purposes.
Unfortunately, the
question,"How large is large
enough?" does not have a simple answer.