COMMON MISTEAKS
MISTAKES IN
USING STATISTICS: Spotting and Avoiding Them
Problems with Stepwise Model Selection Procedures
"... perhaps the most
serious source of
error lies in letting statistical procedures make decisions for you."
"Don't be too quick to turn
on the
computer. Bypassing the brain to compute by reflex is a sure recipe for
disaster."
Good and Hardin, Common Errors in Statistics (and How to
Avoid Them), p. 3, p. 152
Various algorithms have been developed for aiding in model selection.
Many of them are "automatic", in the sense that they have a "stopping
rule" (which it might be possible for the researcher to set or change
from a default value) based on criteria such as value of a t-statistic
or an F-statistic. Others might be better termed "semi-automatic," in
the sense that they automatically list various options and values of
measures that might be used to help evaluate them.
Caution: Different regression
softwares may use the same name (e.g.,"Forward
Selection" or "Backward Elimination") to designate different
algorithms. Be sure to read the documentation to know find out just
what the algorithm does in the software you are using -- in particular,
whether it has a stopping rule or is of the "semi-automatic" variety.
Cook and Weisberg1(p. 280) comment,
"We do not recommend such
stopping rules for routine use since they can reject perfectly
reasonable submodels from further consideration. Stepwise procedures
are easy to explain, inexpensive to compute, and widely used. The
comparative simplicity of the results from stepwise regression with
model selection rules appeals to many analysts. But, such algorithmic
model selection methods must be used with caution."
They give an example (pp. 280 - 281) illustrating how
stepwise regression algorithms will generally result in models
suggesting that the remaining terms are more important than they really
are, and that the R2 values of the submodels obtained may be
misleadingly large.
Ryan2 (pp.269- 273 and 284 - 286)
elaborates on these points. One underlying problem with methods based
on t or F statistics is that they effectively ignore problems of
multiple inference.
Alternatives to
Stepwise Selection Methods
- There are various criteria that may be considered in
evaluating models.
One that has intuitive appeal is Mallow's C-statistic. It is an
estimate of Mean Square Error, and can also be regarded as a measure
that accounts for both bias and variance.3 Other aids
include Akaike's Information Criterion (AIC) and variations4
and Added
Variable Plots.
- And, of course, context can be important to consider in
deciding on a model. For example, the questions of interest can dictate
that certain variables need to remain in the model; or quality of data
can help decide which of two variables to retain. Several
considerations may come into play in deciding on a model.
- Wei Liu has described methods using simultaneous
confidence bands that are useful in some situations for variable
selection or comparing regression models more generally.5
- Also, other
regression methods (e.g., Ridge Regression) may be useful instead of
Least Squares Regression.
- For more discussion of model selection methods, see
Cook
and Weisberg (Chapters 10, 11 and 17 - 20); Ryan (Chapters 7, 11, 12
and references therein); Berk (pp. 126 - 135); and P.
I. Good and J. W. Hardin (2006). Common
Errors in Statistics (And How to
Avoid Them), Wiley (Chapters 10, 11, Appendix A).
Notes:
1. R.D. Cook and S. Weisberg (1999), Applied Regression Including Computing and
Graphics, Wiley
2. T. Ryan (2009), Modern
Regression
Methods, Wiley
3. Mallow's statistic is discussed in, e.g., Cook and Weisberg (pp. 272
- 280), Ryan (pp. 273 - 277 and 279 - 283), R. Berk (2004) Regression Analysis: A Constructive
Critique, Sage (pp.130 - 135); see also Lecture
Notes on Selecting Terms
4. K. P. Burnham and D. R. Anderson (2002), Model selection and Multimodel Inference:
A Practical Information-Theoretic Approach, 2nd ed., SpringerK.
P. Burnham and D. R. Anderson (2002), Model
selection and Multimodel Inference: A Practical Information-Theoretic
Approach, 2nd ed., SpringerK. P. Burnham
and D. R. Anderson (2002), Model
selection and Multimodel Inference: A Practical Information-Theoretic
Approach, 2nd ed., Springer has an
extensive discussion of Akaike's Information Criterion and related
methods. For some common mistakes in using AIC, see pp. 63, 66, 108, 119
5. W. Liu (2011) Simultaneous Inference in Regression,
CRC Press. Liu also has Matlab® programs for implementing
procedures available from his website. (Click on
the link to the book.)
Last updated January 20, 2012