COMMON MISTEAKS
MISTAKES IN
USING STATISTICS: Spotting and Avoiding Them
Using Plots to Check Model Assumptions
Overall Cautions:
1. Unfortunately, these
methods are typically better at telling you when the model assumption
does not fit than when it
does.
2. Different techniques have different model assumptions,
so
additional model checking plots may be needed; be sure to consult a
good reference for the particular technique you are considering using.
General Rule of Thumb:
First check any independence assumptions, then any equal variance
assumption, then any assumption on distribution (e.g., normal) of
variables.
Rationale:
Techniques are usually least robust to departures from independence1
and most robust to departures from normality2, 3.
Suggestions and
Guidelines for Checking Specific Model Assumptions
Checking for Independence
Independence assumptions are
usually formulated in terms of error terms rather than in terms of the
outcome variables. For example, in simple linear regression, the model
equation is
Y = α
+ βx + ε,
where Y is the outcome
(response) variable and ε
denotes
the error term (also a random variable). It is the error terms that are
assumed to be independent4, not the values of the response
variable.
We do not know the values of the error terms ε,
so we can only
plot the residuals ei (defined as the observed value yi
minus the fitted value, according to the model),
which approximate the error terms.
Rule of Thumb:
To check independence, plot residuals against any time variables
present (e.g., order of observation), any spatial variables
present, and any variables used in the technique (e.g., factors,
regressors). A pattern that is not random suggests lack of independence.
Rationale:
Dependence on time or
spatial variables are common sources of lack of independence, but the
other plots might also detect lack of independence.
Comments:
1. Because time or
spatial correlations are so frequent, it is
important when making observations to record
any time or spatial variables that could conceivably influence results.
This not only allows you to make the residual plots to detect
possible lack of independence, but also allows you to change to a
technique incorporating additional time or spatial variables if lack of
independence is detected in these plots.
2. Since it is known that the
residuals sum to zero, they are not
independent, so the plot is really a very rough approximation.
Checking
for Equal Variance
Plot residuals against fitted
values (in most cases, these are the estimated conditional means,
according to the model), since it is not uncommon for
conditional variances to depend on conditional means, especially to
increase as conditional means increase. (This would show up as a funnel
or megaphone shape to the residual plot.)
Caution: Hypothesis tests for
equality of variance are often not reliable, since they also have model
assumptions and are typically not robust to departures from these
assumptions.
Checking
for Normality or Other
Distribution
Caution:
A histogram (whether of
outcome values or of residuals) is not
a good way to check for normality, since histograms of
the same data but using different bin sizes
(class-widths) and/or different cut-points between the bins may look
quite different. Example.
Instead, use a probability plot
(also know as a quantile plot
or Q-Q plot). Click here for a pdf file explaining what these are.
Most statistical software has a function for producing these.
Caution: Probability plots for
small data sets are often misleading; it is very hard to tell whether
or not a small data set comes from a particular distribution.
Checking for Linearity
When considering a simple
linear regression model, it is important to check the linearity
assumption -- i.e., that the conditional means of the response variable
are a linear function of the predictor variable. Graphing the response
variable vs the predictor can often give a good idea of whether or not
this is true. However, one or both of the following refinements may be
needed:
1. Plot residuals (instead of
response) vs. predictor. A non-random pattern suggests that a simple
linear model is not appropriate; you may need to transform the response
or predictor, or add a quadratic or higher term to the mode.
2. Use a scatterplot smoother such as lowess (also known as loess) to
give a visual estimation of the conditional mean. Such smoothers are
available in many regression software packages. Caution: You may need to
choose a value of a smoothness parameter. Making it too large will
oversmooth; making it too small will not smooth enough.
When considering
a linear regression with just two
terms, plotting response (or residuals) against the two terms
(making a
three-dimensional graph) can help gauge suitability of a linear model,
especially if your software allows you to rotate the graph.
Caution:
It is not possible to gauge
from
scatterplots whether a linear model in more than two predictors is
suitable. One way to address this problem is to try to transform the
predictors to approximate multivariate normality.5 This will
ensure not only that a linear model is appropriate for all
(transformed) predictors together, but that a linear model is
appropriate even when some transformed predictors are dropped from the
model.6
1. Some techniques may merely require uncorrelated errors rather than
independent errors, but the model-checking plots needed are the same.
2. Robustness to departures from normality is related to the Central Limit Theorem,
since most estimators are linear combinations of the observations, and
hence approximately normal if the number of observations is large.
3. In this context, "robustness" can be formulated in terms of the
effect
of the departure from a model assumption on the Type I error rate. See
Van Belle (2008) Statistical Rules of Thumb, pp. 173 - 177 and the
references given there for more detail.
4. In some formulations of regression, the error terms are only assumed
to be uncorrelated, not necessarily independent.
5. See Cook and Weisberg (1999) Applied
Regression Including Computing and Graphics, p. 324- 329 for one
way to do this.
6. If a linear model fits with all predictors included, it is not true that a linear model will
still fit when some predictors are dropped. For example, if E(Y|X1,
X2) = 1 + 2X1 +3X2
(so that a
linear model fits when Y is regressed on both X1and
X2), but E(
X1|
X2) = log(X1),
then it can be calculated that
E(Y|X1) = 1 +2X1
+ 3log(X1),
which says that a linear model does not fit when y is regressed
on X1 alone.