COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

## Common Mistakes involving Power

1. Rejecting a null hypothesis without considering practical significance.

A study with large enough sample size will have high enough power to detect minuscule differences that are not of practical significance. Since power typically  increases with increasing sample size, practical significance is important to consider. See Type I and II Errors and Sample size calculations to plan an experiment (GraphPad.com) for examples.

2. Accepting a null hypothesis when a result is not statistically significant, without taking power into account.

Since smaller samples yield smaller power, a small sample size may not be able to detect an important difference. If there is strong evidence that the power of a procedure will indeed detect a difference of practical importance, then accepting the null hypothesis may be appropriate1; otherwise it is not -- all we can legitimately say then is that we fail to reject the null hypothesis.

3. Being convinced by a research study with low power.

As discussed under Detrimental Effects of Underpowered Studies underpowered studies are likely to be inconsistent and are often misleading.

4.  Neglecting to do a power analysis/sample size calculation before collecting data

Without a  power analysis, you may end up with a result that does not really answer the question of interest: you may obtain a result which is not statistically significant, but is not able to detect a difference of practical significance. You might also waste resources by using a sample size that is larger than is needed to detect a relevant difference.

5. Neglecting to take multiple inference into account when calculating power.

If more than one inference procedure is used for a data set, then power calculations need to take that into account. Doing a power calculation for just one inference will result in an underpowered study. For more detail, see
• Maxwell, S. E. and K Kelley (2011), Ethics and Sample Size Planning, Chapter 6 (pp. 159 - 183) in Panter, A. T. and S. K. Sterba, Handbook of Ethics in Quantitative Methodology, Routledge
• Maxwell, S.E. (2004), The persistence of underpowered studies in psychological research: Causes, consequences, and remedies, Psychological Methods 9 (2), 147 - 163.
For discussion of power analysis when using Efron's version of false discovery rate, see Section 5.4 of B. Efron (2010), Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Cambridge, or see his Stats 329 notes

6. Using standardized effect sizes rather than considering the particulars of the question being studied.

"Standardized effect sizes" (sometimes called "canned" effect sizes) are expressions involving more than one of the factors that needs to be taken into consideration in considering appropriate levels of Type I and Type II error in deciding on power and sample size. Examples
• Cohen's effect size d is the ratio of the raw effect size (e.g., difference in means when comparing two groups) and the error standard deviation. But each of these typically needs to be considered individually in designing a study and determining power; it's not necessarily the ratio that's important.2
• The correlation (or squared correlation) in regression. The correlation in simple linear regression involves three quantities: the slope, the y standard deviation, and the x standard deviation. Each of these three typically needs to be considered individually in designing the study and determining power and sample size. In multiple regression, the situation may be even more complex.
For specific examples illustrating these points, see:
Lenth, Russell V. (2001) Some Practical Guidelines for Effective Sample Size Determination, American Statistician, 55(3), 187 - 193 (Early draft available here.)
Lenth, Russell V. (2000) Two Sample-Size Practices that I Don't Recommend, comments from panel discussion at the 2000 Joint Statistical Meetings in Indianapolis.

7. Confusing retrospective power and prospective power.

• Power as defined above for a hypothesis test is also called prospective or a priori power. It is a conditional probability, P(reject H0 | Ha), calculated without using the data to be analyzed. (In fact, it is best calculated before even gathering the data, and taken into account in the data-gathering plan.)
• Retrospective power is calculated after the data have been collected, using the data
• Depending on how retrospective power is calculated, it might be legitimate to use to estimate the power and sample size for a future study, but cannot legitimately be used as describing the power of the study from which it is calculated.
• However, some methods of calculating retrospective power calculate the power to detect the effect observed in the data -- which misses the whole point of considering practical significance. These methods typically yield simply a transformation of p-value. See  Lenth, Russell V. (2000) Two Sample-Size Practices that I Don't Recommend for more detail.
• See J. M. Hoenig and D. M. Heisey (2001) "The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis," The American Statistician 55(1), 19-24 and the Stat Help Page "Retrospective (Observed) Power Analysis" for more discussion and further references.

Notes
1. In many cases, however, it would be best to use a test for equivalence. For more information, see:
2. See also Figure 1 of Richard H. Browne, The t-Test p Value and Its Relationship to the Effect Size and P(X>Y),  The American Statistician, February 1, 2010, 64(1), p. 31. This shows that, for the two-sample t-test, Cohen's classification of "large" d as 0.8 still gives substantial overlap between the two distributions being compared; d needs to be close to 4 to result in minimal overlap of the distributions.

Last updated August 28, 2012