COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

Introduction Types of Mistakes Suggestions Resources Table of Contents About

Overview of Frequentist Confidence Intervals

The General Situation:

We are considering a random variable Y.

We are interested in a certain parameter (e.g., a proportion, or mean, or regression coefficient, or variance) associated with the random variable Y.

We do not know the value of the parameter.

Goal 1: We would like to estimate the unknown parameter, using data from a sample.

Goal 2: We would also like to get some sense of how good our estimate is.

The first goal is usually easier than the second.

Example: If the parameter we are interested in estimating is the mean of the random variable, then we can estimate it using a sample mean.¹

Important note on terminology:

The mean of the random variable Y is also called the expected value or the expectation of Y. It is denoted E(Y). It is also called the population mean, often denoted µ. It is what we do not know in this example.

A sample mean is typically denoted ȳ (read "y-bar"). It is calculated from a sample y₁, y₂, ... , y_n of values of Y by the familiar formula ȳ = (y₁+ y₂+ ... + y_n)/n.

The population mean µ and a sample mean ȳ are usually not the same. Confusing them is a common mistake.

Note that I have written "the population mean" but "a sample mean". A sample mean depends on the sample chosen. Since there are many possible samples, there are many possible sample means.

The idea for the second goal: Although we typically have just one sample at hand when we do statistics, the reasoning used in frequentist (classical) inference depends on thinking about all possible suitable samples of the same size n. Which samples are considered "suitable" will depend on the particular statistical procedure to be used. Each statistical procedure has model assumptions that are needed to ensure that the reasoning behind the procedure is sound. The model assumptions determine which samples are "suitable." (Cf. Overview of Frequentist Hypothesis Testing)

Example: The parameter we are interested in estimating is the population mean µ = E(Y) of the random variable Y.

In this case, "suitable sample" turns out to be "simple random sample" (i.e., the model assumptions for the particular procedure require a simple random sample).

So we collect a simple random sample, say of size n, consisting of observations y₁, y₂, ... , y_n. (For example, if Y is "height of an adult American male", we take a sample random sample of n adult American males; y₁, y₂, ... , y_n are their heights.)

We use the sample mean ȳ = (y₁+ y₂+ ... + y_n)/n as our estimate of µ. (This is an example of a point estimate -- a numerical estimate with no indication of how good the estimate is.)

But to get an idea of how good our estimate is, we look at all possible simple random samples of size n from Y. (In the specific example, we consider all possible simple random samples of adult American males, and for each sample of men, the list of their heights.)

One way we can get a sense of how good our estimate is in this situation is to consider the sample means for all possible simple random samples of size n from Y. This amounts to defining a new random variable, which we will call Ȳ_n (read Y-bar sub n). We can describe the random variable as Ȳ_n as "sample mean of a simple random sample of size n from Y", or perhaps more clearly as: "pick a simple random sample of size n from Y and calculate its sample mean". Note that each value of Ȳ_n is an estimate of the population mean µ.

This new random variable Ȳ_n has a distribution. This is called a sampling distribution, since it arises from considering varying samples. The distribution of Ȳ_n gives us information about the variability (as samples vary) of our method of estimating the population mean µ. (Click here to see a summary chart and picture of both distributions. See also the Rice Virtual Lab in Statistics' Sampling Distribution Simulation to visualize sampling distributions for a variety of parameters and a variety of distributions.)

We don't know the sampling distribution (distribution of Ȳ_n ) exactly (in particular, it will depend on µ, which we don't know), but the model assumptions² will tell us enough so that it is possible to do the following:

If we specify a probability (we'll use .95 to illustrate), we can find a number a so that

(*) The probability that Ȳ_n lies between µ - a and µ + a is 0.95

Caution: It is important to get the reference category straight here. This amounts to keeping in mind what is a random variable and what is a constant. Here, Ȳ_n is the random variable (that is, the sample is varying), whereas µ and a are constant.³

A little algebraic manipulation allows us to restate (*) as

(**) The probability that µ lies between Ȳ_n - a and Ȳ_n + a is 0.95

Caution: It is again important to get the reference category correct here. It hasn't changed: it is still the sample that is varying, not µ. So the probability refers to Ȳ_n, not to µ. Thinking that the probability refers to µ is a common mistake in interpreting confidence intervals. It may help to restate (**) as

(***) The probability that the interval from Ȳ_n - a to Ȳ_n + a contains µ is .95.

A demo such as those posted by bioconsulting, R. Webster or W. H. Freeman can help reinforce the correct interpretation.⁴ The Rice Virtual Lab in Statistics' Confidence Interval Simulation shows both 95% and 99% confidence intervals.

We are now faced with two possibilities (assuming the model assumptions are indeed all true):

1) The sample we have taken is one of the 95% for which the interval from Ȳ_n - a to Ȳ_n + a contains µ.
2) Our sample is one of the 5% for which the interval from Ȳ_n - a to Ȳ_n + a does not contain µ.

Unfortunately, we can't know which of these two possibilities is true.

Nonetheless, we calculate the values of Ȳ_n - a and Ȳ_n + a for the sample we have, and call the resulting interval a 95% confidence interval for µ. We can say that we have obtained the confidence interval by using a procedure which, for 95% of all simple random samples from Y, of the given size, produces an interval containing the parameter we are estimating. Unfortunately, we can't know whether or not the sample we have used is one of the 95% of "good" samples that yield a confidence interval containing the true mean µ , or whether the sample we have is one of the 5% of "bad" samples that yield a confidence interval that does not contain the true mean µ. We can just say that we have used a procedure that "works" 95% of the time.

In general: We can follow a similar procedure for many other situations to obtain confidence intervals for parameters.

Each type of confidence interval procedure has its own model assumptions; if the model assumptions are not true, we are not sure that the procedure does what is claimed. However, some procedures are robust to some degree to some departures from models assumptions -- i.e, the procedure works pretty closely to what is intended if the model assumption is not too far from true. Robustness depends on the particular procedure; there are no "one size fits all" rules; see Using an Inappropriate Method of Analysis for more details.

We can decide on the "level of confidence" we want; that is, we can choose 90%, 99%, etc. rather than 95%. Just which level of confidence is appropriate depends on the circumstances. More

The confidence level determines the percentage of samples for which the procedure results in an interval containing the true parameter.

However, a higher level of confidence will produce a wider confidence interval -- that is, less certainty in our estimate. So there is a trade-off between degree of confidence and degree of certainty.

Sometimes the best we can do is a procedure that only gives approximate confidence intervals -- that is, the sampling distribution can be described only approximately.

If the sampling distribution is not symmetric, we can't expect the confidence interval to be symmetric around the estimate. There may be slightly different procedures for calculating the endpoints of the confidence interval.

There are variations such as "upper confidence limits" or "lower confidence limits" where we are only interested in estimating how large or how small the estimate might be.

Notes:
1. The sample will need to be a simple random sample in order for the sample mean to be a reasonable estimate of the population mean; see also biased sampling.

2. The inference procedure behind this example is known as the "large-sample z-procedure for the mean." It is only an approximate procedure, but it is quite good for large sample sizes. Since it is simpler than the t-procedure that is an "exact" procedure for inference for a mean, it is a good example for illustrating the basic idea of sampling distribution, especially if we include a model assumption that Y is normally distributed. This, plus the other model assumptions, will imply that Ȳ_n also has a normal distribution. The only problem is then that we don't know the standard deviation of the sampling distribution -- but for n large enough, the sample standard deviation is very close, close enough to get a good approximation to a.

3. Stating (*) as above may help keep the reference category straight. Restating as

"The probability that µ - a < Ȳ_n < µ + a is 0.95"

may prompt the common mistake of thinking that µ is the variable. This highlights an important difference between frequentist and Bayesian statistics: In frequentist statistics, parameters are assumed to be constant but unknown, whereas in Bayesian inference for parameters, the parameter may be assumed to be a random variable, with different values corresponding to different "states of nature".

4. Still another way to rephrase this: Define two new random variables L_n = Ȳ_n - a and R_n = Ȳ_n + a. Then the probability that a simple random sample for Y will have the property that the interval from L_n to R_n contains µ is .95.