COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

# Multiple Inference

"Recognize that any frequentist statistical test has a random chance of indicating significance when it is not really present. Running multiple tests on the same data set at the same stage of an analysis increases the chance of obtaining at least one invalid result. Selecting the one "significant" result from a multiplicity of parallel tests poses a grave risk of an incorrect conclusion. Failure to disclose the full extent of tests and their results in such a case would be highly misleading."
Professionalism Guideline 8, Ethical Guidelines for Statistical Practice, American Statistical Association, 1997

Performing more than one statistical inference procedure on the same data set is called multiple inference, or joint inference, or simultaneous inference, or multiple testing, or multiple comparisons, or the problem of multiplicity.

Performing multiple inference without adjusting the Type I error rate accordingly is a common error in research using statistics1.

The Problem

Recall that if you perform a hypothesis test using a certain  significance level (we will use 0.05 for illustration), and if you obtain a p-value less than
0.05, then there are three possibilities:
1. The model assumptions for the hypothesis test are not satisfied in the context of your data.
2. The null hypothesis is false.
3. Your sample happens to be one of the 5% of samples satisfying the appropriate model conditions for which the hypothesis test gives you a Type I error.
Now suppose you are performing two hypothesis tests using the same data. Suppose that in fact all model assumptions are satisfied and both null hypotheses are true. There is in general no reason to believe that the samples giving a Type I error for one test will also give a Type I error for the other test.2 So we need to consider the joint Type I error rate:

Joint Type I error rate: The probability that a randomly chosen sample (of the given size, satisfying the appropriate model assumptions) will  give a Type I error for at least one of the hypothesis tests performed.

The joint Type I error rate is also known as the overall Type I error rate, or  joint significance level, or the simultaneous Type I error rate, or the family-wise error rate (FWER), or the experiment-wise error rate, etc. The acronym FWER is becoming more and more common, so will be used in the sequel, often along with another name for the concept as well.

An especially serious form of neglect of the problem of multiple inference is the one alluded to in the quote above: Trying several tests and reporting just one significant test, without disclosing how many tests were performed or correcting the significance level to take into account the multiple inference.

The problem of multiple inference also occurs for confidence intervals. In this case, we need to focus on the confidence level. Recall that a 95% confidence interval is an interval
obtained by using a procedure which, for 95% of all suitably random samples, of the given size from the random variable and population of interest, produces an interval containing the parameter we are estimating (assuming the model assumptions are satisfied). In other words, the procedure does what we want (i.e. gives an interval containing the true value of the parameter) for 95% of suitable samples. If we are using confidence intervals to estimate two parameters, there is no reason to believe that the 95% of samples for which the procedure "works" for one parameter (i.e. gives an interval containing the true value of the parameter) will be the same as the 95% of samples for which the procedure "works" for the other parameter. If we are calculating confidence intervals for more than one parameter, we can talk about the joint (or overall or simultaneous or family-wise or experiment-wise) confidence level. For example,  a group of confidence intervals (for different parameters) has an overall 95% confidence level (or 95% family-wise confidence level, etc.) if the intervals are calculated using a procedure which, for 95% of all suitably random samples, of the given size from the population of interest, produces for each parameter an interval containing that parameter (assuming the model assumptions are satisfied).

Unfortunately, there is no simple formula to cover all cases: Depending on the context, the samples giving Type I errors for two tests might be the same, they might have no overlap, or they could be somewhere in between. Various techniques for bounding the FWER (joint Type I error rate)  or otherwise dealing with the problem of multile inference have been devised for various special circumstances3. Only two methods will be discussed here.

Bonferroni method:

Fairly basic probability calculations can show that if the sum of the Type I error rates for different tests is less than α, then the overall Type I error rate (FWER) for the combined tests will be at most
α.
• So, for example, if you are performing five hypothesis tests and would like an FWER (overall significance level) of at most 0.05, then using significance level 0.01 for each test will give an FWER (overall significance level) of at most 0.05.
• Similarly, if you are finding confidence intervals for five parameters and want an overall confidence level of 95%, using the 99% confidence level for each confidence interval will give you overall confidence level at least 95%. (Think of confidence level as 1 - α.)
The Bonferroni method can be a used as a fall-back method when no other method is known to apply. However, if a method that applies to the specific situation is available, it will often be better (less conservative) than the Bonferroni method.

The Bonferroni method is also useful for apportioning the overall Type I error between different types of inference -- in particular, between pre-planned inference (the inference planned as part of the design of the study) and "data-snooping" inferences
(inferences based on looking at the data and noticing other things of interest). For example, to achieve an overall Type I error rate of .05, one might apportion an overall significance level of .04 to the pre-planned inference and .01 to data-snooping.  However, this apportioning should be done before analyzing the data.

Whichever method is used, it is important to make the calculations based on the number of tests that have been done, not just the number that are reported. (See Data Snooping for more discussion.)

False discovery rate:

An alternative to bounding Type I error was introduced by Benjamini and Hochberg in 19954: bounding the False Discovery Rate.

The False Discovery Rate (FDR) of a group of tests is the expected value5 of the ratio of falsely rejected hypotheses to all rejected hypotheses.

Note that the family-wise error rate (FWER) focuses on the possibility of making any error among all the inferences performed, whereas the false discovery rate (FDR) tells you what proportion of the rejected null hypotheses are, on average, really true. Bounding the FDR rather than the FWER may be a more reasonable choice when many inferences are performed, especially if there is little expectation of harm from falsely rejecting a null hypothesis. Thus it is increasingly being adopted in areas such as micro-array gene expression experiments or neuro-imaging.

As with the FWER, there are various methods of actually bounding the false discovery rate6.
Efron has used the phrase "false discovery rate" in a slightly different way in his development of empirical Bayes methods for dealing with multiple inference7.

Multiple inference in regression:

Not accounting for multiple inference in regression is a common mistake. There are at least three types of situations in which this often occurs:

1. Many stepwise variable selection methods involve multiple inference. (The methods described in item (3) below offer an alternative in some cases.)

2. An analysis may involve inference for more than one regression coefficient. Often a good way to handle this is by using confidence regions. These are generalizations of confidence intervals to more than one dimension. For example, in simple linear regression, the confidence region for the intercept
β0 and slope β1 is called a confidence ellipse. A 95% confidence ellipse for β0 and β1  is an elliptical region obtained by using a procedure which, for 95% of all  random samples suitable for the regression assumptions, produces an interval containing the pair (β0 , β1). Confidence regions are discussed in a number of textbooks, and several software packages have the capability of computing them.

3. Considering confidence intervals for conditional means at more than one value of the predictors. Many standard software packages will allow the user to plot confidence bands easily. These typically show confidence intervals for conditional means that are calculated individually. However, when considering more than one confidence interval, one needs instead simultaneous confidence bands, which account for multiple inference. These are less well-known. A discussion of several methods may be found in W. Liu (2011) Simultaneous Inference in Regression, CRC Press. Liu also has Matlab® programs for calculating the confidence bands available from his website.(Click on the link to the book.)

Subtleties and controversies

Bounding the overall Type I error rate (FWER) will reduce the power of the tests, compared to using individual Type I error rates. Some researchers use this as an argument against multiple inference procedures. The counterargument is the argument for multiple inference procedures to begin with: Neglecting them will produce excessive numbers of false findings, so that the "power" as calculated from single tests is misleading.

Bounding the False Discovery Rate (FDR) will usually give higher power than bounding the overall Type I error rate (FWER).

Consequently, it is important to consider the particular circumstances, as in considering both Type I and Type II errors in deciding significance levels.  In particular, it is important to consider the consequences of each type of error in the context of the particular research. Examples:
1. A research lab is using hypothesis tests to screen genes for possible candidates that may contribute to certain diseases. Each gene identified as a possible candidate will undergo further testing. If the results of the initial screening are not to be published except in conjunction with the results of the secondary testing,  and if the secondary screening is inexpensive enough that many second level tests can be run, then the researchers could reasonably decide to ignore overall Type I error in the initial screening tests, since there would be no harm or excessive expense in having a high Type I error rate. However, if the secondary tests are expensive, the researchers would reasonably decide to bound either family-wise Type I error rate or False Discovery Rate.
2. Consider a variation of the situation in Example 1: The researchers are using hypothesis tests to screen genes as in Example 1, but plan to publish the results of the screening without doing secondary testing of the candidates identified. In this situation,  ethical considerations would warrant bounding either the FWER or the FDR -- and  taking pains to emphasize in the published report that these results are just of a preliminary screening for possible candidates, and that these preliminary findings need to be confirmed by further testing.

Notes:
1. A. M. Strasak et al (The Use of Statistics in Medical Research, The American Statistician, February 1, 2007, 61(1): 47-55) report that, in an examination of 31 papers from the New England Journal of Medicine and 22 from Nature Medicine (all papers from 2004), 10 (32.3%) of those from NEJM and 6 (27.3%) from Nature Medicine were "Missing discussion of the problem of multiple signiﬁicance testing if occurred."
These two journals are considered the top journals (according to impact figure) in clinical science and in research and experimental medicine, respectively.

2. For a simulation illustrating this, see Jerry Dallal's demo . This simulates the results of 100 independent hypothesis tests, each at 0.05 significance level. Click the "test/clear" button  to see the results of one set of 100 tests (that is, for one sample of data). Click the button
two more times (first to clear and then to do another simulation) to see the results of another set of 100 tests (i.e., for another sample of data). Notice as you continue to do this that i) which tests give type I errors (i.e., are statistically significant at the 0.05 level) varies from sample to sample, and ii) which samples give type I errors for a given test varies from test to test. (To see the latter point, it may help to focus just on the first column.)

3.
Chapters 3 and 4 of B. Efron (2010), Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Cambridge (or see his Stats 329 notes) contains a summary of various attempts to deal with multiple inference.
Nichols, T. and S. Hayasaka (2003), Controlling the familywise error rate in functional neuroimaging: a comparative review,  Statistical Methods in Medical Research 12; 419 -446 (accessible at http://www.fil.ion.ucl.ac.uk/spm/doc/papers/NicholsHayasaka.pdf) gives a survey of Bonferroni-type methods and two other approaches (random field and permuation tests) to bounding FWER, focusing on application in neruoimaging. They discuss model assumptions for each approach and present results of simulations to help users decide which method to use. The Mindhive webpage P threshold FAQ (accessible at http://mindhive.mit.edu/node/90 or http://mindhive.mit.edu/book/export/html/90 Note: Links from both pages seem to be broken.) gives a less technical summary of the multple-comparison problem, with summaries of some of Nichols and Hayasaka's discussion.
Hochberg, Y. and Tamhane, A. (1987) Multiple Comparison Procedures, Wiley
Miller, R.G. (1981) Simultaneous Statistical Inference 2nd Ed., Springer
P. H. Westfall and S. S. Young (1993), Resampling-based Multiple Testing: Examples and Methods for p-Value Adjustment, Wiley
B. Phipson and G. K. Smyth (2010), Permutation P-values Should Never Be Zero: Calculating Exact P-values when Permutations are Randomly Drawn, Statistical Applications in Genetics and Molecular Biology Vol.. 9 Iss. 1, Article 39, DOI: 10.2202/1544-6155.1585
F. Betz, T. Hothorn, P. Westfall (2010), Multiple Comparisons Using R, CRC Press
S. Dudoit and M. J. van der Laan (2008), Multiple Testing Procedures with Application to Genomics, Springer

4. Y. Benjamini and Y. Hochberg (1995), Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B (Methodological), Vol. 57 No. 1, 289 - 300.

5. "Expected value" is another term for mean of a distribution. Here, the distribution is the sampling distribution of the ratio of falsely rejected hypotheses to all rejected hypotheses tested.

6. See, for example:
Y. Benjamini and D. Yekutieli (2005), False Discovery Rate–Adjusted Multiple Confidence Intervals for Selected Parameters, Journal of the American Statistical Association, March 1, 2005, 100(469): 71-81
Y. Benjamini and D. Yekutieli (2001), The Control of the False Discovery Rate in Multiple Testing under Dependency, The Annals of Statistics, vol. 29 N. 4, 1165 - 1186.
Y. Benjamini and Y. Hochberg (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B (Methodological), Vol. 57 No. 1, 289 - 300

7. B. Efron (2010), Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Cambridge, or see his Stats 329 notes