Using Plots to Check Model Assumptions

1. Unfortunately, these methods are typically better at telling you when the model assumption does not fit than when it does.

2. Different techniques have different model assumptions, so additional model checking plots may be needed; be sure to consult a good reference for the particular technique you are considering using.

where Y is the outcome (response) variable and ε denotes the error term (also a random variable). It is the error terms that are assumed to be independent⁴, not the values of the response variable.

We do not know the values of the error terms ε, so we can only plot the residuals e_i (defined as the observed value y_i minus the fitted value, according to the model), which approximate the error terms.

Rule of Thumb: To check independence, plot residuals against any time variables present (e.g., order of observation), any spatial variables present, and any variables used in the technique (e.g., factors, regressors). A pattern that is not random suggests lack of independence.

Rationale: Dependence on time or spatial variables are common sources of lack of independence, but the other plots might also detect lack of independence.

Comments:

1. Because time or spatial correlations are so frequent, it is important when making observations to record any time or spatial variables that could conceivably influence results. This not only allows you to make the residual plots to detect possible lack of independence, but also allows you to change to a technique incorporating additional time or spatial variables if lack of independence is detected in these plots.

2. Since it is known that the residuals sum to zero, they are not independent, so the plot is really a very rough approximation.

Plot residuals against fitted values (in most cases, these are the estimated conditional means, according to the model), since it is not uncommon for conditional variances to depend on conditional means, especially to increase as conditional means increase. (This would show up as a funnel or megaphone shape to the residual plot.)

Caution: Hypothesis tests for equality of variance are often not reliable, since they also have model assumptions and are typically not robust to departures from these assumptions.

Caution: A histogram (whether of outcome values or of residuals) is not a good way to check for normality, since histograms of the same data but using different bin sizes (class-widths) and/or different cut-points between the bins may look quite different. Example.

Instead, use a probability plot (also know as a quantile plot or Q-Q plot). Click here for a pdf file explaining what these are. Most statistical software has a function for producing these.

Caution: Probability plots for small data sets are often misleading; it is very hard to tell whether or not a small data set comes from a particular distribution.

When considering a simple linear regression model, it is important to check the linearity assumption -- i.e., that the conditional means of the response variable are a linear function of the predictor variable. Graphing the response variable vs the predictor can often give a good idea of whether or not this is true. However, one or both of the following refinements may be needed:

1. Plot residuals (instead of response) vs. predictor. A non-random pattern suggests that a simple linear model is not appropriate; you may need to transform the response or predictor, or add a quadratic or higher term to the mode.

2. Use a scatterplot smoother such as lowess (also known as loess) to give a visual estimation of the conditional mean. Such smoothers are available in many regression software packages. Caution: You may need to choose a value of a smoothness parameter. Making it too large will oversmooth; making it too small will not smooth enough.

Caution: It is not possible to gauge from scatterplots whether a linear model in more than two predictors is suitable. One way to address this problem is to try to transform the predictors to approximate multivariate normality.⁵ This will ensure not only that a linear model is appropriate for all (transformed) predictors together, but that a linear model is appropriate even when some transformed predictors are dropped from the model.⁶

COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

Introduction Types of Mistakes Suggestions Resources Table of Contents About

Using Plots to Check Model Assumptions