COMMON MISTEAKS
MISTAKES IN
USING STATISTICS: Spotting and Avoiding Them
Common Mistakes involving
Power
1. Rejecting
a null hypothesis without considering practical significance.
2. Accepting
a
null hypothesis when a result is not statistically significant, without
taking power into account.
Since smaller samples yield
smaller power, a small sample size
may not be able to detect an important difference. If there is strong
evidence that the power of a procedure will indeed detect a difference
of practical importance, then accepting the null hypothesis may be
appropriate1; otherwise it is not -- all we
can legitimately say then is
that
we fail to reject the null hypothesis.
3. Being
convinced by a research study with low power.
4.
Neglecting
to do a power analysis/sample size calculation before collecting data
Without a power
analysis, you may end up with a result that does not really answer the
question of interest: you may obtain a result which is not
statistically significant, but is not able to detect a difference of
practical significance. You might also waste resources by using a
sample size that is larger than is needed to detect a relevant
difference.
5. Neglecting
to take multiple
inference into
account when calculating power.
If more than one inference
procedure is used for a data set, then power calculations need to take
that into account. Doing a power calculation for just one inference
will result in an underpowered study. For more detail, see
- Maxwell, S. E. and K Kelley (2011), Ethics and Sample Size
Planning, Chapter 6 (pp. 159 - 183) in Panter, A. T. and S. K. Sterba, Handbook of Ethics in Quantitative
Methodology, Routledge
- Maxwell, S.E. (2004), The persistence of underpowered
studies in
psychological research: Causes, consequences, and remedies, Psychological Methods 9
(2), 147 - 163.
For discussion of power analysis when using Efron's version of
false discovery rate, see Section 5.4 of B. Efron (2010), Large-Scale Inference: Empirical Bayes
Methods for Estimation, Testing, and Prediction, Cambridge,
or see his Stats 329 notes.
6. Using
standardized effect sizes rather than considering the particulars of
the question being studied.
"Standardized effect sizes"
(sometimes
called "canned" effect sizes) are expressions involving more
than one of the factors that needs to be taken into consideration in
considering appropriate levels of Type I and Type II error in deciding
on power and sample size. Examples:
- Cohen's effect size d is the ratio of the raw effect size
(e.g.,
difference in means when comparing two groups) and the error standard
deviation. But each of these typically needs to be considered
individually in designing a study and determining power; it's not
necessarily the ratio that's important.2
- The correlation (or squared correlation) in regression. The
correlation in simple linear regression involves three quantities: the
slope, the y standard deviation, and the x standard deviation. Each of
these three typically needs to be considered individually in designing
the study and determining power and sample size. In multiple
regression, the situation may be even more complex.
For specific examples illustrating these points, see:
Lenth,
Russell V. (2001) Some Practical Guidelines
for Effective Sample Size Determination, American Statistician, 55(3), 187 -
193 (Early draft available here.)
Lenth, Russell V. (2000) Two
Sample-Size Practices that I Don't Recommend, comments from panel
discussion at the 2000 Joint Statistical Meetings in Indianapolis.
7. Confusing retrospective
power and prospective
power.
- Power as defined above for a
hypothesis test is
also called prospective or a priori power. It is a conditional
probability, P(reject H0 | Ha), calculated
without using the data to be analyzed. (In fact, it is best
calculated before even gathering the data, and taken into account in
the data-gathering plan.)
- Retrospective
power is calculated after the
data have been collected, using the
data.
- Depending on how retrospective power is calculated, it might
be legitimate to use to estimate the power and
sample size for a future
study, but cannot
legitimately be used as describing the power of the
study from which it is calculated.
- However, some methods of calculating retrospective power
calculate the power to detect the effect observed in the data -- which
misses the whole point of considering practical significance. These
methods typically yield simply a transformation of p-value. See
Lenth, Russell V. (2000) Two
Sample-Size Practices that I Don't Recommend for more detail.
- See J. M. Hoenig and D.
M. Heisey (2001) "The Abuse of Power: The Pervasive Fallacy of Power
Calculations for Data Analysis," The
American Statistician 55(1), 19-24 and the Stat
Help Page "Retrospective (Observed) Power Analysis" for more
discussion and further references.
Notes
1. In many cases, however, it would be best to use a test for equivalence. For more
information, see:
2. See also Figure 1 of Richard H. Browne, The t-Test p Value and
Its
Relationship to the Effect Size and P(X>Y), The American Statistician, February
1, 2010, 64(1), p. 31. This shows that, for the two-sample t-test,
Cohen's classification of "large" d as 0.8 still gives substantial
overlap between the two distributions being compared; d needs to be
close to 4 to result in minimal overlap of the distributions.
Last updated August 28,
2012