COMMON MISTEAKS
MISTAKES IN
USING STATISTICS: Spotting and Avoiding Them
Multiple Inference
"Recognize
that any frequentist statistical test has a random chance of indicating
significance when it is not really present. Running multiple tests on
the same data set at the same stage of an analysis increases the chance
of obtaining at least one invalid result. Selecting the one
"significant" result from a multiplicity of parallel tests poses a
grave risk of an incorrect conclusion. Failure to disclose the full
extent of tests and their results in such a case would be highly
misleading."
Performing more than one statistical inference procedure on
the same data set is called multiple
inference, or joint inference,
or simultaneous
inference, or multiple testing, or multiple
comparisons, or the problem of multiplicity.
Performing multiple inference without
adjusting the Type I error rate accordingly is a common error in research using
statistics1.
The Problem
Recall that if you perform a hypothesis test using a certain significance level (we will use 0.05 for
illustration), and if you obtain a
p-value less than 0.05,
then there are three possibilities:
- The model assumptions for the hypothesis
test are not
satisfied in the context of your data.
- The null hypothesis is false.
- Your sample happens to be one of the 5% of
samples
satisfying the appropriate model conditions for which the hypothesis
test gives you a Type I error.
Now suppose you are performing two
hypothesis tests using the same data.
Suppose that in fact
all model assumptions are satisfied and both null hypotheses are true. There
is in general no reason to believe that the samples giving a Type I
error for one test will also give a Type I error for the other test.2
So we need to consider the joint
Type I error rate:
Joint Type I error rate:
The probability that a randomly chosen
sample (of the given size,
satisfying the appropriate model assumptions) will give a Type I
error for at least one of the
hypothesis tests performed.
The joint Type I error rate is also known as the overall Type I error
rate, or
joint
significance level, or the simultaneous Type I
error rate, or the family-wise error rate
(FWER), or
the experiment-wise
error rate, etc. The acronym FWER is becoming more and more
common, so will be used in the sequel, often along with another name
for the concept as well.
An especially serious form of neglect of the problem of
multiple inference is the one alluded to in the quote above: Trying
several tests and reporting just one significant test, without
disclosing how many tests were performed or correcting the significance
level to take into account the multiple inference.
The problem of multiple inference also occurs for confidence intervals.
In this
case, we need to focus on the confidence level. Recall that a 95% confidence interval is an interval obtained
by using a procedure which, for 95% of all suitably random samples, of
the given size from
the random variable and population of interest, produces an interval
containing the parameter we are
estimating (assuming the model assumptions are satisfied). In other
words, the procedure does what we want (i.e. gives
an
interval containing the true value of the parameter) for 95% of
suitable samples. If we
are using confidence intervals to estimate two parameters, there is no
reason to believe that the 95% of samples for which the procedure
"works" for one parameter (i.e. gives
an
interval containing the true value of the parameter) will be the same as the 95% of samples for
which the procedure "works" for the other parameter. If we are
calculating confidence intervals for more than one parameter, we can
talk about the joint
(or overall
or simultaneous or family-wise or
experiment-wise)
confidence level. For example, a group of confidence
intervals (for different parameters) has an overall 95% confidence
level (or 95%
family-wise
confidence level, etc.) if
the intervals are calculated using a procedure which, for 95% of all
suitably random samples, of the given size from the population of
interest, produces for each
parameter an interval containing that
parameter (assuming the model assumptions are satisfied).
What
to
do about it
Unfortunately, there is no simple formula to cover all cases: Depending
on the context, the samples giving Type I errors for two tests
might be the same, they might have no overlap, or they could be
somewhere in between. Various techniques for bounding the FWER (joint
Type I error rate) or otherwise dealing with the problem of
multile inference have been devised for various
special circumstances3. Only two methods will be discussed
here.
Bonferroni method:
Fairly basic probability calculations can show that if the sum of the
Type I error rates for different tests is less than α, then the
overall Type I error rate (FWER) for the combined tests will be at most
α.
- So, for example, if you are performing five
hypothesis tests and would like an FWER (overall significance level) of
at most 0.05,
then using significance level 0.01 for each test will give an FWER
(overall
significance level) of at most 0.05.
- Similarly, if you are finding confidence
intervals for
five parameters and want an overall confidence level of 95%, using the
99% confidence level for each confidence interval will give you overall
confidence level at least 95%. (Think of confidence level as 1 - α.)
The Bonferroni method can be a used as a fall-back method when no other
method is known
to apply. However, if a method that applies to the specific situation
is available, it will often be better (less conservative) than the
Bonferroni method.
The Bonferroni method is also useful for apportioning the overall Type
I error between different types of inference -- in particular, between
pre-planned inference (the
inference planned as part of the design of the study)
and "data-snooping"
inferences (inferences based on looking
at the data and noticing other things of interest).
For example, to achieve an overall Type I error rate of .05, one might
apportion an overall significance level of .04 to the pre-planned
inference and .01 to data-snooping. However, this apportioning
should be done before
analyzing the data.
Whichever method is used, it is
important to make the calculations based on the number of tests that
have been done, not just the number that are reported. (See Data Snooping for more discussion.)
False
discovery rate:
An alternative to bounding Type I error was introduced by Benjamini and
Hochberg in 19954: bounding the False Discovery Rate.
The False Discovery Rate
(FDR) of a group of tests is the expected value5 of the
ratio of
falsely rejected hypotheses to all rejected hypotheses.
Note that the family-wise error rate (FWER) focuses on the
possibility of making any
error among all the inferences performed, whereas the false discovery
rate (FDR) tells you what proportion of the rejected null hypotheses
are, on average, really true. Bounding the FDR rather than the
FWER may be a more reasonable choice when many inferences are
performed, especially if there is little expectation of harm from
falsely rejecting a null hypothesis. Thus it is increasingly being
adopted in areas such as micro-array gene expression experiments or
neuro-imaging.
As with the FWER, there are various methods of actually bounding the
false discovery rate6.
Efron has used the phrase "false discovery rate" in a slightly
different way in his development of empirical Bayes methods for dealing
with multiple inference7.
Multiple
inference in regression:
Not accounting for multiple inference in regression is a common mistake. There are at least
three types of situations in which this often occurs:
1. Many stepwise variable selection methods
involve multiple inference. (The methods described in item (3) below
offer an alternative in some cases.)
2. An analysis may involve inference for more than one regression
coefficient. Often a good way to handle this is by using confidence regions.
These are generalizations of confidence intervals to more than one
dimension. For example, in simple linear regression, the confidence
region for the intercept β0 and
slope β1 is called a confidence ellipse. A 95%
confidence ellipse for β0 and β1
is an elliptical region obtained
by using a procedure which, for 95% of all random samples
suitable for the regression assumptions, produces an interval
containing the pair (β0
, β1). Confidence regions are discussed
in a number of textbooks, and several software packages have the
capability of computing them.
3. Considering confidence intervals for conditional means at
more than one value of the predictors. Many standard software packages
will allow the user to plot confidence
bands easily. These typically show confidence intervals for
conditional means that are calculated individually. However, when
considering more than one confidence interval, one needs instead simultaneous confidence bands,
which account for multiple inference. These are less well-known. A
discussion of several methods may be found in W. Liu (2011) Simultaneous Inference in Regression,
CRC Press. Liu also has Matlab® programs for calculating the
confidence bands available from his website.(Click
on the link to the book.)
Subtleties
and
controversies
Bounding the overall Type I error rate (FWER) will reduce the
power of
the tests, compared to using individual Type I error rates.
Some researchers use this as an argument against multiple inference
procedures. The counterargument is the argument for
multiple inference procedures to begin with: Neglecting them will
produce excessive numbers of false findings, so that the "power" as
calculated from single tests is misleading.
Bounding the False Discovery Rate (FDR) will usually give higher power
than bounding the overall Type I error rate (FWER).
Consequently, it is important to consider the particular circumstances,
as in considering both Type I and Type II
errors in deciding significance levels. In particular, it is
important to consider the consequences of each type of error in the
context of the particular research. Examples:
- A research lab is using hypothesis
tests to screen
genes for possible candidates that may contribute to certain diseases.
Each gene identified as a possible candidate will undergo further
testing. If the results of the initial screening are not to be
published except in conjunction with the results of the secondary
testing, and if the secondary screening is inexpensive enough
that many second level tests can be run, then the researchers could
reasonably decide to ignore overall Type I error in the initial
screening tests, since there would be no harm or excessive expense in
having a high Type I error rate. However, if the secondary tests are
expensive, the researchers would reasonably decide to bound either
family-wise Type
I error rate or False Discovery Rate.
- Consider a variation of the situation in
Example 1: The
researchers are using hypothesis tests to screen genes as in Example 1,
but plan to publish the results of the screening without doing
secondary testing of the candidates identified. In this
situation, ethical considerations would warrant bounding either
the FWER or the FDR -- and
taking pains to emphasize in the published report that these results
are just of a preliminary screening for possible candidates, and that
these preliminary findings need to be confirmed by further
testing.
Notes:
1. A. M. Strasak et al (The Use of Statistics in Medical Research, The American Statistician, February
1, 2007, 61(1): 47-55) report that, in an examination of 31 papers from
the New England Journal of Medicine
and 22 from Nature Medicine (all
papers from 2004), 10 (32.3%) of those from NEJM and 6 (27.3%) from Nature Medicine were "Missing
discussion of the problem of multiple signifiicance testing if occurred."
These two journals are considered the top journals (according to impact
figure) in clinical science and in research and experimental medicine,
respectively.
2. For a simulation illustrating this, see Jerry Dallal's demo
. This simulates the results of 100 independent hypothesis tests, each
at 0.05 significance level. Click the "test/clear" button to see
the results of one set of 100 tests (that is, for one sample of data).
Click
the button two more times (first to clear
and then to do another simulation) to see the
results of another set of 100 tests (i.e., for another sample of data).
Notice as you continue to do this that i) which tests give type I errors (i.e., are
statistically significant at the 0.05 level) varies from sample to
sample, and ii) which samples
give type I errors for a given test varies from test to test.
(To
see the latter point, it may help to focus just on the first column.)
3. Chapters 3 and 4 of B. Efron (2010), Large-Scale
Inference: Empirical Bayes Methods for Estimation, Testing, and
Prediction, Cambridge (or see his Stats 329 notes)
contains a summary of various attempts to deal with multiple inference.
Nichols, T. and S. Hayasaka
(2003), Controlling the familywise error rate in functional
neuroimaging: a comparative review, Statistical Methods in Medical Research
12; 419 -446 (accessible at http://www.fil.ion.ucl.ac.uk/spm/doc/papers/NicholsHayasaka.pdf)
gives a survey of Bonferroni-type methods and two other approaches
(random field and permuation tests) to bounding FWER, focusing on
application in neruoimaging. They discuss model assumptions for each
approach and present results of simulations to help users decide which
method to use. The Mindhive webpage P threshold FAQ (accessible at http://mindhive.mit.edu/node/90
or http://mindhive.mit.edu/book/export/html/90
Note: Links from both pages seem to be broken.) gives a less technical
summary of the multple-comparison problem, with summaries of some of
Nichols and Hayasaka's discussion.
See also
Hochberg, Y. and Tamhane,
A. (1987) Multiple Comparison
Procedures, Wiley
Miller, R.G. (1981) Simultaneous
Statistical Inference
2nd Ed., Springer
P. H. Westfall and S. S. Young (1993), Resampling-based Multiple Testing:
Examples and Methods for p-Value Adjustment, Wiley
B. Phipson and G. K. Smyth (2010), Permutation P-values Should Never Be
Zero: Calculating Exact P-values when Permutations are Randomly Drawn,
Statistical Applications in Genetics and Molecular Biology Vol.. 9 Iss.
1, Article 39, DOI: 10.2202/1544-6155.1585
F. Betz, T. Hothorn, P. Westfall
(2010), Multiple Comparisons Using R,
CRC Press
S. Dudoit and M. J. van der Laan
(2008), Multiple Testing Procedures
with Application to Genomics, Springer
4. Y. Benjamini and Y. Hochberg (1995), Controlling the false discovery
rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society,
Series B (Methodological), Vol. 57 No. 1, 289 - 300.
5. "Expected value" is another term for mean of a distribution. Here,
the distribution is the sampling distribution of the ratio of falsely
rejected hypotheses to all rejected hypotheses tested.
6. See, for example:
Y. Benjamini and D. Yekutieli
(2005), False Discovery
Rate–Adjusted Multiple Confidence Intervals for Selected
Parameters, Journal of the American
Statistical Association, March 1, 2005, 100(469): 71-81
Y. Benjamini and D. Yekutieli
(2001), The Control
of the False Discovery Rate in Multiple Testing under Dependency, The Annals of Statistics, vol. 29
N. 4, 1165 - 1186.
Y. Benjamini and Y. Hochberg (1995) Controlling
the false discovery
rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society,
Series B (Methodological), Vol. 57 No. 1, 289 - 300
7. B. Efron (2010), Large-Scale
Inference: Empirical Bayes Methods for Estimation, Testing, and
Prediction, Cambridge, or see his Stats 329 notes.
This page last revised 1/21/2013