COMMON MISTEAKS
MISTAKES IN
USING STATISTICS: Spotting and Avoiding Them
Summary Statistics for Skewed
Distributions
Measure of Center
When we focus on the mean of a variable, we are
presumably trying
to focus on what happens "on average," or perhaps "typically". The mean
is very appropriate for
this purpose when the distribution is symmetrical, and especially when
it is "mound-shaped," such as a normal distribution. For a symmetrical
distribution, the mean is in the middle; if the distribution is also
mound-shaped, then values near the mean are typical.
But if a distribution is skewed, then the mean is usually not in the
middle.
Example: The
mean of the ten numbers 1,
1, 1, 2, 2, 3, 5, 8, 12, 17
is 52/10 = 5.2. Seven of the ten numbers are less than the mean,
with
only
three of the ten numbers greater than the mean.
A better
measure of the center for this distribution would be the median,
which in this case is (2+3)/2 = 2.5. Five of the numbers are less than
2.5, and five are greater.
Notice that in this example, the mean is greater than the median. This is
common for a distribution that is skewed
to the right (that is, bunched up toward the left and with a
"tail" stretching toward the right).
Similarly, a distribution
that is
skewed to the left
(bunched up toward the right with a "tail" stretching toward the
left) typically has a mean smaller
than its median. (See
http://www.amstat.org/publications/jse/v13n2/vonhippel.html
for
discussion of exceptions.)
(Note that for a symmetrical distribution, such as a normal
distribution, the mean and median are the same.)
For a practical example (one I have often given my students):
Suppose a friend is considering
moving to Austin and asks you what
houses here typically cost. Would you tell her the mean or the median
house price? Housing prices (in Austin, at least -- think of all those
Dellionaires) are skewed to the right. Unless your friend is rich, the
median housing price would be
more useful than the mean
housing price (which would be larger than the median, thanks to the
Dellioniares' expensive houses).
In fact, many distributions that
occur in practical situations are skewed, not symmetric. (For
some examples, see the Life
is Lognormal! website.)
Implications for Applying Statistical Techniques
How do we work with skewed distributions when so many statistical
techniques
give information about the mean? First, note that most of these techniques assume that the
random variable in question has a distribution that is normal.
Many of these techniques are somewhat "robust" to departures from
normality -- that is, they still give pretty accurate results if the
random
variable has a distribution that is not too far from normal. But many common statistical techniques are not valid for strongly skewed
distributions. Two possible alternatives are:
I. Taking logarithms of the original variable.
Fortunately, many of the skewed random variables that arise
in applications are lognormal.
That means that the logarithm of the random variable is normal, and
hence most common statistical techniques can be applied to the logarithm of
the original variable. (With robust techniques, approximately lognormal
distributions can also be handled by taking logarithms.) However, doing
this may
require some care in interpretation. There are three common routes to
interpretation when dealing with logs of variables.
1. In many fields, it is common to
work with
the log of
the original outcome variable, rather than the original variable. Thus
one might do a hypothesis test for equality
of the means of the logs of
the variables. A difference in the means of the logs will tell
you that
the original distributions are different, which in some applications
may answer the question of interest.
2. For situations that require
interpretation
in terms of
the
original variable, we can often exploit the fact that the logarithm
transformation and its inverse, the exponential transformation,
preserve order. This implies that they take the median of one variable
to the median of another. So if a variable X is lognormal
and we take its logarithm, Y = logX 1, we
get a normal distribution, whose mean is the
same as its median. If we back-tranform (by exponentiating -- so X =
exp(Y)1), the median of Y
goes to the median of X. Thus statements about means for the
log-transformed
variable Y give us statements about medians for the original variable
X.
(Note that in this situation, the original variable X is skewed, so we
probably should be talking about its median rather than its mean
anyhow.) We can also back-transform a confidence interval for the mean
of Y to get a confidence interval for the median of X.
(Typically,
a confidence interval for the
mean of Y will be symmetric about the estimated mean of Y, but the
confidence interval for the median
of the original variable X that is obtained by back-transforming will not be symmetric about the
estimated median
of the original variable. )
3. In some situations, we can use properties
of logs to
say useful things when we back-transform. For example, if we regress Y
=
log10X on U and get Y = a + bU + error, then we can say that
increasing U by one unit increases the median of X by a factor of 10b.
2
Note: Not all skewed distributions are close
enough to lognormal to be handled using a log transformation.
Sometimes other transformations (e.g., square roots) can yield a
distribution that is close enough to normal to apply standard
techniques. However, interpretation will depend on the transformation
used.
II. Quantile Regression Techniques
Standard regression estimates the mean of the conditional
distribution (conditioned on the values of the predictors) of the
response variable. For example, in simple linear regression, with one
predictor X and response variable Y, we calculate an equation y = a +
bx that tells us that when X takes on the value x, the mean of Y is approximately a + bx.3 Quantile
regression is a method for estimating conditional quantiles 4, including the median. For more on
quantile regression, see http://www.econ.uiuc.edu/~roger/research/rq/rq.html.
Measures of Spread
For a normal distribution, the standard
deviation is a
very appropriate measure of variability (or spread) of the
distribution. (Indeed, if you know a distribution is normal, then
knowing its mean and standard deviation tells you exactly which normal
distribution you have.) But for skewed distributions, the standard
deviation gives no information on the asymmetry. It is better to use
the first and third quartiles4, since these will give some sense of the
asymmetry of the distribution.
Notes:
1.
We could use logs base e, base 10, or even base 2. If we use log base
b, then "exp" will be the function "raise b to that power."
2. The mean of Y when U = u is E(Y|U = u) = a +
bu, and the mean of Y when U = u+1 is E(Y|U = u + 1) = a + b(u + 1).
Exponentiating (which here means raising ten to the power, since we are
working with log base 10) gives
median(Y | U = u) = 10a
+ bu
median(Y| U = u + 1) = 10a + b(u+1) = 10b(10a+bu),
which by the previous line is 10bmedian(Y
| U = u),
so
median(Y| U = u + 1)/median(Y
| U = u) = 10b.
3. If we are trying to predict what Y is when X =
x, our best estimate is also a + bx, but the estimate isn't as good for
Y as it is for the mean of Y. This leads to a common mistake: using the
confidence interval (which is appropriate for the conditional mean of Y
when X = x) to express our degree of uncertainty (or margin of error)
when we are predicting Y (not the conditional mean of Y) when X = x. If
we use a + bx to predict Y, then we need to use the prediction interval, which is
typically much wider than the confidence interval. In other words, we
have more uncertainty when predicting Y than when predicting its mean
-- which makes sense if you stop to think about it.
4. A quantile
(also known as a percentile) of
a distribution is the number that separates the values of the
distribution into a specified lower fraction and the corresponding
upper fraction. The median is the quantile corresponding to the
fraction 1/2. As in the example above, half
of the values are above the median, and half below. We could similarly
talk about the first quartile
(one quarter of the values below and three quarters above), the third quartile (three quarters of
the distribution below and one quarter above), the second quartile (just another name
for the median), the first quintile
(one fifth below and four fifths above), etc.
Last updated October 12, 2016