COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

- We are considering a random variable Y.
- We are interested in a certain parameter (e.g., a proportion, or mean, or regression coefficient, or variance) associated with the random variable Y.
- We do not know the value of the parameter.
- Goal 1: We would like to estimate the unknown parameter, using data from a sample.
- Goal 2: We would also like to get some sense of how good our estimate is.

Example: If the parameter we are interested in estimating is the mean of the random variable, then we can estimate it using a sample mean.

Important
note on terminology:

The
idea for the second goal: Although
we typically have just one sample at hand when we do statistics, the reasoning used in frequentist
(classical) inference depends on thinking about all possible suitable
samples of the same size n. Which samples are considered
"suitable" will depend on the particular statistical procedure to be
used. Each statistical
procedure has model assumptions
that are needed to ensure that the reasoning behind the procedure is
sound. The model assumptions determine which samples are "suitable."
(Cf. Overview of Frequentist
Hypothesis Testing)- The mean of the random variable Y is also called the expected value or the expectation of Y. It is denoted E(Y). It is also called the population mean, often denoted µ. It is what we do not know in this example.
- A sample mean
is typically denoted ȳ (read
"y-bar"). It is calculated
from a sample y
_{1}, y_{2}, ... , y_{n}of values of Y by the familiar formula ȳ = (y_{1}+ y_{2}+ ... + y_{n})/n. - The population mean µ and a sample mean ȳ are usually not the same. Confusing them is a common mistake.
- Note that I have written "the population mean" but "a sample mean". A sample mean depends on the sample chosen. Since there are many possible samples, there are many possible sample means.

Example: The parameter we are interested in estimating is the population mean µ = E(Y) of the random variable Y.

- In this case, "suitable sample" turns out to be "simple random sample" (i.e., the model assumptions for the particular procedure require a simple random sample).
- So we collect a simple random sample, say
of size n,
consisting of observations y
_{1}, y_{2}, ... , y_{n}. (For example, if Y is "height of an adult American male", we take a sample random sample of n adult American males; y_{1}, y_{2}, ... , y_{n}are their heights.) - We use the sample mean ȳ = (y
_{1}+ y_{2}+ ... + y_{n})/n as our estimate of µ. (This is an example of a point estimate -- a numerical estimate with no indication of how good the estimate is.) - But to get an idea of how good our estimate is, we look at all possible simple random samples of size n from Y. (In the specific example, we consider all possible simple random samples of adult American males, and for each sample of men, the list of their heights.)
- One way we can get a sense of how good our
estimate is
in this situation is to consider the sample
means for all possible
simple random samples of size n from Y. This amounts to defining
a new
random variable, which we will call Ȳ
_{n}(read Y-bar sub n). We can describe the random variable as Ȳ_{n}as "sample mean of a simple random sample of size n from Y", or perhaps more clearly as: "pick a simple random sample of size n from Y and calculate its sample mean". Note that each value of Ȳ_{n}is an estimate of the population mean µ. - This new random variable Ȳ
_{n}has a distribution. This is called a sampling distribution, since it arises from considering varying samples. The distribution of Ȳ_{n}gives us information about the variability (as samples vary) of our method of estimating the population mean µ. (Click here to see a summary chart and picture of both distributions. See also the Rice Virtual Lab in Statistics' Sampling Distribution Simulation to visualize sampling distributions for a variety of parameters and a variety of distributions.) - We don't know the sampling distribution
(distribution
of Ȳ
_{n}) exactly (in particular, it will depend on µ, which we don't know), but the model assumptions^{2}will tell us enough so that it is possible to do the following:

- If we specify a probability (we'll use
.95 to
illustrate), we can find a number
`a`so that

(*) The
probability that Ȳ_{n}
lies between µ
- a and µ
+ a
is 0.95

Caution: It is important to get the reference category straight here. This amounts to keeping in mind what is a random variable and what is a constant. Here, Ȳ_{n}
is the random
variable (that is, the sample
is varying), whereas µ and a are
constant.^{3}

Caution: It is important to get the reference category straight here. This amounts to keeping in mind what is a random variable and what is a constant. Here, Ȳ

- A little algebraic manipulation allows us to restate (*) as

(**)
The probability that µ
lies between Ȳ_{n} - a
and Ȳ_{n} + a is 0.95

Caution: It is again important to get the reference category correct here. It hasn't changed: it is still the sample that is varying, not µ. So the probability refers to Ȳ_{n}, not to µ. Thinking that the probability
refers to µ
is a common mistake in
interpreting confidence intervals. It may help to restate (**)
as

(***) The probability that the interval from Ȳ_{n} - a to Ȳ_{n} + a
contains µ
is .95.

A demo such as those posted by bioconsulting, R. Webster or W. H. Freeman can help reinforce the correct interpretation.^{4}
The Rice Virtual Lab in Statistics' Confidence
Interval Simulation shows both 95% and 99% confidence intervals.

Caution: It is again important to get the reference category correct here. It hasn't changed: it is still the sample that is varying, not µ. So the probability refers to Ȳ

(***) The probability that the interval from Ȳ

A demo such as those posted by bioconsulting, R. Webster or W. H. Freeman can help reinforce the correct interpretation.

- We are now faced with two possibilities (assuming the model assumptions are indeed all true):

1) The sample we have taken
is one of the 95% for which the interval
from Ȳ_{n} - a to Ȳ_{n} + a
contains µ.

2) Our sample is one of the 5% for which the interval from Ȳ_{n} - a to Ȳ_{n} + a does not contain µ.

2) Our sample is one of the 5% for which the interval from Ȳ

Unfortunately, we can't
know which of these two
possibilities is true.

- Nonetheless, we calculate the values of Ȳ
_{n}- a and Ȳ_{n}+ a for the sample we have, and call the resulting interval a 95% confidence interval for µ. We can say that we have obtained the confidence interval by using a procedure which, for 95% of all simple random samples from Y, of the given size, produces an interval containing the parameter we are estimating. Unfortunately, we can't know whether or not the sample we have used is one of the 95% of "good" samples that yield a confidence interval containing the true mean µ , or whether the sample we have is one of the 5% of "bad" samples that yield a confidence interval that does not contain the true mean µ. We can just say that we have used a procedure that "works" 95% of the time.

In general: We can follow a similar procedure for many other situations to obtain confidence intervals for parameters.

- Each type of confidence interval procedure has its own model assumptions; if the model assumptions are not true, we are not sure that the procedure does what is claimed. However, some procedures are robust to some degree to some departures from models assumptions -- i.e, the procedure works pretty closely to what is intended if the model assumption is not too far from true. Robustness depends on the particular procedure; there are no "one size fits all" rules; see Using an Inappropriate Method of Analysis for more details.
- We can decide on the "level of confidence" we want; that is, we can choose 90%, 99%, etc. rather than 95%. Just which level of confidence is appropriate depends on the circumstances. More
- The confidence level determines the percentage of samples for which the procedure results in an interval containing the true parameter.
- However, a higher level of confidence will produce a wider confidence interval -- that is, less certainty in our estimate. So there is a trade-off between degree of confidence and degree of certainty.
- Sometimes the best we can do is a procedure that only gives approximate confidence intervals -- that is, the sampling distribution can be described only approximately.
- If the sampling distribution is not symmetric, we can't expect the confidence interval to be symmetric around the estimate. There may be slightly different procedures for calculating the endpoints of the confidence interval.
- There are variations such as "upper
confidence limits"
or "lower confidence limits" where we are only interested in estimating
how large or how small the estimate might be.

Notes:

1. The sample will need to be a simple random sample in order for the sample mean to be a reasonable estimate of the population mean; see also biased sampling.

2. The inference procedure behind this example is known as the "large-sample z-procedure for the mean." It is only an approximate procedure, but it is quite good for large sample sizes. Since it is simpler than the t-procedure that is an "exact" procedure for inference for a mean, it is a good example for illustrating the basic idea of sampling distribution, especially if we include a model assumption that Y is normally distributed. This, plus the other model assumptions, will imply that Ȳ

3. Stating (*) as above may help keep the reference category straight. Restating as

"The probability that µ - a
< Ȳ_{n} <
µ +
a is 0.95"

may prompt the common mistake of thinking that µ is
the variable. This highlights an important difference between
frequentist and Bayesian statistics: In
frequentist statistics,
parameters are assumed to be constant but unknown, whereas in Bayesian
inference for parameters, the parameter may be assumed to be a random
variable, with different values corresponding to different "states of
nature".4. Still another way to rephrase this: Define two new random variables L