COMMON MISTEAKS MISTAKES IN
USING STATISTICS: Spotting and Avoiding Them
Overview of Frequentist Confidence
Intervals
The General Situation:
- We are considering a random
variable Y.
- We are interested in a certain parameter (e.g., a proportion, or
mean, or
regression coefficient, or variance) associated with the random
variable Y.
- We do not know the value of the parameter.
- Goal 1:
We
would like to estimate the unknown parameter,
using data from a sample.
- Goal 2:
We
would also like to get some sense of how
good our estimate is.
The
first goal is usually
easier than the second.
Example: If the parameter we
are interested in estimating is the mean
of the random variable, then
we can estimate it using a sample
mean.1
Important
note on terminology:
- The mean of the random
variable Y is also called the expected value or the expectation of Y. It is denoted
E(Y). It is also called the population
mean, often denoted µ. It is what we do not know in this
example.
- A sample mean
is typically denoted ȳ (read
"y-bar"). It is calculated
from a sample y1, y2, ... , yn of
values of Y by the familiar formula ȳ = (y1+
y2+ ... + yn)/n.
- The population mean µ and
a sample mean ȳ are usually not the same. Confusing them is a common mistake.
- Note that I have written "the population mean" but "a sample mean". A sample mean
depends on the sample chosen. Since there are many possible samples,
there are many possible sample means.
The
idea for the second goal: Although
we typically have just one sample at hand when we do statistics, the reasoning used in frequentist
(classical) inference depends on thinking about all possible suitable
samples of the same size n. Which samples are considered
"suitable" will depend on the particular statistical procedure to be
used. Each statistical
procedure has model assumptions
that are needed to ensure that the reasoning behind the procedure is
sound. The model assumptions determine which samples are "suitable."
(Cf. Overview of Frequentist
Hypothesis Testing)
Example:
The parameter we
are interested in estimating is the population mean µ
= E(Y) of the random variable Y.
- In this case, "suitable sample" turns out
to be "simple
random sample" (i.e., the model assumptions for the particular
procedure require a simple random sample).
- So we collect a simple random sample, say
of size n,
consisting of observations y1, y2,
... , yn. (For example, if Y is
"height of an adult American male", we take a sample random sample of n
adult American males; y1, y2,
... , yn are their heights.)
- We use the sample mean ȳ = (y1+
y2+ ... + yn)/n as our estimate of µ.
(This is an example of a point estimate
-- a numerical estimate with no indication of how good the estimate
is.)
- But to get an idea of how good our estimate
is, we look
at all possible simple random
samples of size n from Y. (In the specific example, we consider
all possible simple random samples of adult American males, and for
each sample of men, the list of their heights.)
- One way we can get a sense of how good our
estimate is
in this situation is to consider the sample
means for all possible
simple random samples of size n from Y. This amounts to defining
a new
random variable, which we will call Ȳn (read Y-bar sub n).
We can describe the random variable as Ȳn
as
"sample mean of a simple random sample of size n from Y", or perhaps
more clearly as: "pick
a simple random sample of size n from Y and calculate its sample
mean". Note that each value
of Ȳn
is an estimate of the
population mean µ.
- This new random variable Ȳn
has a distribution. This is called a sampling distribution,
since it
arises from considering varying samples. The
distribution of Ȳn
gives us information about the
variability (as samples vary) of our method of estimating the
population mean µ.
(Click here to see a summary
chart and picture of both
distributions. See also the Rice Virtual Lab in Statistics' Sampling
Distribution Simulation to visualize sampling distributions for a
variety of parameters and a variety of distributions.)
- We don't know the sampling distribution
(distribution
of Ȳn )
exactly (in particular, it will depend on µ, which we don't know), but
the model assumptions2 will tell us enough so that it is
possible to do the following:
- If we specify a probability (we'll use
.95 to
illustrate), we can find a number a so that
(*) The
probability that Ȳn
lies between µ
- a and µ
+ a
is 0.95
Caution: It is important to
get the reference category straight
here. This amounts to keeping in mind what is a random variable and
what is a constant. Here, Ȳn
is the random
variable (that is, the sample
is varying), whereas µ and a are
constant.3
- A little algebraic manipulation allows us
to restate
(*) as
(**)
The probability that µ
lies between Ȳn - a
and Ȳn + a is 0.95
Caution: It is again important
to get the reference category correct here. It hasn't changed: it is
still the sample that is
varying, not
µ. So
the probability refers to Ȳn, not to µ. Thinking that the probability
refers to µ
is a common mistake in
interpreting confidence intervals. It may help to restate (**)
as
(***) The probability that the interval from Ȳn - a to Ȳn + a
contains µ
is .95.
A demo such as those posted by bioconsulting,
R.
Webster or W.
H. Freeman can help reinforce the correct interpretation.4
The Rice Virtual Lab in Statistics' Confidence
Interval Simulation shows both 95% and 99% confidence intervals.
- We are now faced with two possibilities
(assuming the
model assumptions are indeed all true):
1) The sample we have taken
is one of the 95% for which the interval
from Ȳn - a to Ȳn + a
contains µ.
2) Our sample is one of the 5% for which the
interval from Ȳn - a to Ȳn + a does not contain µ.
Unfortunately, we can't
know which of these two
possibilities is true.
- Nonetheless, we calculate the values of Ȳn - a and Ȳn + a for the
sample we have, and call the resulting interval a 95% confidence interval for µ. We can say that we have obtained the confidence interval
by using a procedure which, for 95% of all simple random samples from
Y, of the
given size, produces an interval containing the parameter we are
estimating. Unfortunately, we can't know whether or not the
sample we have used is one of the 95% of "good" samples that yield a
confidence interval containing the true mean µ , or
whether the sample we have is one of the 5% of "bad" samples that yield
a confidence interval that does not contain the true mean µ. We can just say that
we have used a procedure that "works" 95% of the time.
In general: We can follow a similar procedure for many other
situations to obtain confidence intervals for parameters.
- Each type of confidence interval procedure
has its own
model assumptions; if the model assumptions are not true, we are not
sure that the procedure does what is claimed. However, some procedures
are robust to some degree to some departures from models assumptions --
i.e, the procedure works pretty closely to what is intended if the
model assumption is not too far from true. Robustness depends on the
particular procedure; there are no "one size fits all" rules; see Using an Inappropriate Method of
Analysis for more details.
- We can decide on the "level of confidence"
we want;
that is, we can choose 90%, 99%, etc. rather than 95%. Just which level
of confidence is appropriate depends on the circumstances. More
- The confidence level determines the
percentage of
samples for which the procedure results in an interval containing the
true parameter.
- However, a
higher
level of confidence will produce a
wider confidence interval -- that is, less certainty in our
estimate.
So there is a trade-off between
degree of confidence and degree of
certainty.
- Sometimes the best we can do is a procedure
that only
gives approximate confidence intervals -- that is, the sampling
distribution can be described only approximately.
- If the sampling distribution is not
symmetric, we can't
expect the confidence interval to be symmetric around the estimate.
There may be slightly different procedures for calculating the
endpoints of the confidence interval.
- There are variations such as "upper
confidence limits"
or "lower confidence limits" where we are only interested in estimating
how large or how small the estimate might be.
Notes:
1. The sample will need to be a simple random sample
in order for the sample mean to be a reasonable estimate of the
population mean; see also biased sampling.
2. The inference procedure behind this example is known as the
"large-sample z-procedure for the mean." It is only an approximate
procedure, but it is quite good for large sample sizes. Since it is
simpler than the t-procedure that is an "exact" procedure for inference
for a mean, it is a good example for illustrating the basic idea of
sampling distribution, especially if we include a model assumption that
Y is normally
distributed. This, plus the other model assumptions, will imply that Ȳn
also has a normal distribution. The only problem is then that
we don't know the standard deviation of the sampling distribution --
but for n large enough, the sample standard deviation is very close,
close enough to get a good approximation to a.
3. Stating (*) as above may help keep the reference category straight.
Restating as
"The probability that µ - a
< Ȳn <
µ +
a is 0.95"
may prompt the common mistake of thinking that µ is
the variable. This highlights an important difference between
frequentist and Bayesian statistics: In
frequentist statistics,
parameters are assumed to be constant but unknown, whereas in Bayesian
inference for parameters, the parameter may be assumed to be a random
variable, with different values corresponding to different "states of
nature".
4. Still another way to rephrase this: Define two new random variables Ln
= Ȳn - a
and Rn
= Ȳn + a. Then
the probability that a simple random sample for Y will have the
property that the interval from
Ln
to Rn
contains µ
is .95.