This
site is under construction. Please check back every few weeks for
updates
COMMON MISTEAKS
MISTAKES IN
USING STATISTICS: Spotting and Avoiding Them
Assuming linearity is
preserved when variables are dropped
One common mistake in using "variable selection"
methods is to assume that if one or more variables are dropped, then
the appropriate model using the remaining variables can be obtained
simply by deleting the dropped variables from the "full model" (i.e.,
the model with all the explanatory variables). This assumption is in general false.
Example:
If the true model is E(Y|X1, X2) = 1 + 2X1
+3X2 . Thus a
linear model fits when Y is regressed on
both X1and X2.
But if in addition, E(X1| X2)
= log(X1),
then it can be calculated that E(Y|X1)
= 1 +2X1 + 3log(X1),
which shows that a linear model does not
fit when Y is regressed on X1
alone (and, in particular, that the model E(Y|X1)
= 1 +2X1 is
incorrect.)
One method that sometimes works to get around this problem is to
transform the variables to have a multivariate normal distribution,
then work with the transformed variables. This will ensure that the
conditional means are a linear function of the transformed explanatory
variables, no matter which subset of explanatory variables is chosen.
Such a transformation is sometimes possible with some variant of a
Box-Cox transformation procedure. See, e.g., pp. 236 and 324 - 329 Cook
and Weisberg's text1 for more details.
1. Cook and Weisberg (1999) Applied Regression Including Computing and
Graphics, Wiley.