## This site is under construction. Please check back every few weeks for updates

COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

# More Precise Definition of Simple Random Sample

In practice in applying statistical techniques, we are interested in random variables defined on the population under study. In the examples mentioned in the preceding page, we might be interested in:

Example 1: The difference in blood pressure with and without taking a certain drug.
Example 2: The number of heart bypass surgeries performed in a particular year, or the number of such surgeries that are successful, or the number in which the patient has complications of surgery, etc.
Example 3: The number that comes up on the die.

If we take a sample of units from the population, we have a corresponding sample of values of the random variable.

For example, if the random variable (let's call it Y) is difference in blood pressure with and without taking the drug, then the sample will consist of values we can call
y1, y2, ..., yn,
where
n = number of people in the sample from the population
and where
The people in the sample are listed as person 1, person 2, etc.
y1 = the difference in blood pressures (that is, the value of Y) for the first person in the sample,
y2 = the difference in blood pressures (that is, the value of Y) for the second person in the sample
etc.

»» If the next paragraph is too complex or mathematical for you, just skip to The Bottom Line below.

Abstractly, we can think of this situation as describing n random variables Y1, Y2, ..., Yn as follows:
Y1 is defined as the value of Y for the first person in a sample of the population;
Y1 is defined as the value of Y for the first person in a  sample of the population;
etc.
The difference between using the small y's and the large Y's is that when we use the small y's we are thinking of a fixed sample of size n from the population, but when we use the large Y's, we are thinking of letting the sample vary (but always with size n).

Precise definition of simple random sample of a random variable:

"The sample y1, y2, ..., yn is a simple random sample" means that the associated random variables Y1, Y2, ..., Yn are independent.

Intuitively, "independent" means that the values of any subset of the random variables Y1, Y2, ..., Yn do not influence the values of the other random variables in the list. This is the mathematical formulation of "random process" in this situation.

Connection with the initial definition of simple random sample

Recall Example 3 above:

We are tossing a die;  the number that comes up on the die is our random variable Y.

If we use the Moore and McCabe's definition of simple random sample, our population is all possible tosses of the die. Our simple random sample is n different tosses. The different tosses of the die are independent events, which means that in the precise definition above, the random variables
Y1, Y2, ..., Yn are indeed independent: The numbers that come up in some tosses in no way influence the numbers that come up in other tosses.

Compare this with example 2:

Our population is all hospitals in the U.S. that perform heart bypass surgery.

If we use Moore and McCabe's definition of simple random sample of size n, we end up with n distinct  hospitals.

This means that when we have chosen the first hospital in our simple random sample, we cannot choose it again to be in our simple random sample. Thus the events "Choose the first hospital in the sample; choose the second hospital in the sample; ... ," are not independent events: The choice of first hospital restricts the choice of the second and subsequent hospitals in the sample.

If we now consider the random variable Y = the number of heart bypass surgeries performed in 2008,  then a consequence is that the
random variables
Y1, Y2, ..., Yn
are not independent

The Bottom Line:

In many cases, the Moore-McCabe definition does not coincide with the more precise definition.

More specifically, the Moore-McCabe definition allows sampling without replacement, whereas the more precise definition requires sampling with replacement.

The Bad News: The precise definition is the one that is used in the mathematical theorems that justify the procedures of statistical inference.

The Good News: If the population is large enough, the Moore -McCabe definition is close enough for all practical purposes.

Unfortunately, the question,"How large is large enough?" does not have a simple answer.