COMMON MISTEAKS
MISTAKES IN
USING STATISTICS: Spotting and Avoiding Them
Examples of Checking Model Assumptions Using Well-established
Fact or Theorems
Note: Here,
"well established" means well
established by empirical evidence and/or sound mathematical reasoning.
This is not the same as
"well-accepted," since sometimes things may be well-accepted without
sound evidence.
1. Using laws of
physics
Hooke's Law says that when
a weight that is not too large (below what is called the "elastic
limit") is placed on the end of a spring, the length of the (stretched)
spring is approximately a linear function of the weight. This tells us
that if we do an experiment with a spring by putting various weights
(below the elastic limit) on it and measuring the length of the
spring, we are justified in using a linear model,
Length = A×Weight + B
2. Using the Central
Limit Theorem
The Central Limit Theorem1 says that
for most distributions, linear combinations (e.g., the sum or the
mean) of a large enough number of
independent random variables is approximately normal. Thus, if a random
variable in question is the sum of independent random variables, then
it is usually2 safe to assume that it is approximately
normal.
For example, adult human heights (at least if we restrict to one sex3)
are the sum of many heights: the heights of the ankles, lower legs,
upper legs, pelvis, many vertebrae, and head. Empirical evidence
suggests that these heights vary roughly independently (e.g., the ratio
of height of lower leg to that of upper leg varies considerably). Thus
it is plausible by the Central Limit Theorem that human heights are
approximately normal. This in fact is supported by empirical evidence.
The Central Limit Theorem can also be used to reason that some
distributions are approximately lognormal
-- that is, that the logarithm of the random variable is normal.
For example, the distribution of a pollutant might be determined by
successive independent dilutions of an original emission. This
translates into mathematical terminology by saying that the amount of
pollution (call this random variable Y) in a given small region is the product of independent random
variables. Thus logY is the sum
of independent random variables. If the number of successive dilutions
is large enough, the reasoning above shows that logY is approximately
normal, and hence that Y is approximately lognormal.4, 5
1. Actually, there are several versions of the Central Limit Theorem,
essentially concerning different types of distributions. The paraphrase
given here is good enough for most practical purposes. See also the
Rice Virtual Lab in Statistics' Sampling
Distribution Simulation, which can be used to show how the version
of the Central Limit Theorem for means works for various distributions.
2. Notable exceptions are if the random variables being summed
have "heavy tails" (also called leptokurtic), or are strongly bimodal,
or very strongly skewed (especially if the sums involved are not
large.)
3. If we consider both sexes, then we loose independence, since
the
average height for males is higher than the average height for females.
However, since the average height for males is not that much higher
than the average height for females, it turns out that the overall
distribution of heights for all adult humans is not far from normal --
the mode is a little off to one side, and the top is slightly wider
than for a normal distribution.
4. In practice, one would usually work with logY, using a technique
that
requires approximate normality.
5. For more about lognormal distributions, see Ott (1995) Environmental
Statistics and Data Analysis; van Belle (2008) Statistical Rules of
Thumb, pp 88 - 90, and the Life
is Lognormal website and further references given there.
Updated Sept. 25, 2011