COMMON MISTEAKS
MISTAKES IN
USING STATISTICS: Spotting and Avoiding Them
Data Snooping
Data
snooping refers
to statistical inference that the
researcher decides to perform after
looking at the data (as contrasted with pre-planned inference, which the
researcher plans before
looking at the data).
Data snooping can be done
professionally and ethically, or misleadingly and unethically, or
misleadingly out of ignorance. Data snooping misleadingly
out of
ignorance is a common error in
using statistics. The problems with data snooping are essentially the
problems of multiple inference.
One way in which researchers unintentionally obtain misleading results
by data snooping is in failing to account for all of the data snooping they
engage in. In particular, in
accounting for Type I error when data snooping, you need to count not
just the actual hypothesis tests performed, but also all comparisons
looked at when deciding which post hoc (i.e., not pre-planned)
hypothesis tests to try.
Example:
A group of researchers plans to compare three
dosages
of a drug in a clinical trial. There is no
pre-planned intent to compare effects broken down by sex, but the sex
of the subjects is recorded. The researchers have decided to have an
overall Type I error rate of 0.05, allowing 0.03 for the pre-planned
inferences and 0.02 for any data snooping they might decide to do. The
pre-planned comparison shows no statistically significant difference
between the three dosages when the data are not broken down by sex.
However, since the researchres have recorded sex of the patients, they
decide
to look at the outcomes broken down by combination of sex and dosage.
They
notice that the results for women in the high-dosage group look much
better than the results for the men in the low dosage group, and
perform a hypothesis test to check that out. In accounting for Type I error, the
researchers need to take the number of
data-snooping inferences performed as 15, not one. The reason is
that they have looked at fifteen comparisons: there are 3×2
= 6 dosage-by-sex combinations, and hence (6×5)/2
= 15 pairs of dosage-by-sex
combinations. Thus the significance level for the post hoc test should
not be 0.02. but 0.02/151.
For some discussions of multiple inference and data snooping with a
humerous slant, see:
Suggestions
for data snooping professionally and ethically
I. Educate yourself on the
limitations of statistical inference: Model assumptions, the
problems of Types I and II errors, power, and multiple
inference, including the "hidden comparisons" that may be involved
in data snooping (as in the above example).
II. Plan your study to take into account
the problems involving model assumptions, Type I and II errors, power,
multiple inference. Some specifics to consider:
a. If you will be
gathering data, decide before
gathering the data:
- What questions you are trying to answer.
- How you will gather the data, and the
inference
procedures you intend to use to help answer your
questions. These need to be planned together, to maximize the chances
that the data will fit the model assumptions of the inference
procedures.
- Whether or not you will engage in data
snooping.
- The Type
I error rate (or the false discovery
rate) and power that would be appropriate
(considering the consequences of Type I and
Type II errors in the situation
you are studying). Be sure to allow some portion of Type I error
for any data snooping you think you might do.
Then do a power analysis to
see what sample size is
needed to meet these criteria.
- Take into account any relevant
considerations such as intent-to-treat
analysis or how you will deal
with missing data.
- If
the sample
size needed is too large for your
resources, you will need to either obtain additional resources or scale
back the aims of your study.
b. If you plan to use
existing data, you will need to go
through a process similar to that in (a) before looking at the data:
- Decide what questions you are trying to
answer.
- Find out how the data were gathered.
- Decide on inference
procedures that i) will address your questions of interest and ii) have
model assumptions compatible with how the data were collected. If this turns out to be impossible, the
data are not suitable.
- Decide whether or not you will
engage in data snooping.
- Decide the type
I
error rate (or false discovery rate) and power
that would be
appropriate (considering the consequences of
these Type I and Type II
errors in the situation you are
studying). Remember to allow some portion of Type I error
for data snooping, if you are likely to engage in any.
Then do a power
analysis to see what sample size is needed
to
meet these criteria.
- Take into account any relevant
considerations such as intent-to-treat
analysis or how you will deal
with missing data.
- If
the sample
size needed is larger than the available data set, you will need to
either scale
back the aims of your study, or find or create another larger data set.
c. If data snooping is intended to be the
purpose or an important part of your study,
then before you look at the data,
divide it randomly into two parts: One to be for used for discovery
purposes (generating hypotheses), the other to be used for confirmatory
purposes (testing hypotheses).
- Be careful to do the randomization in a
manner that
preserves the structure of the data. For example, if you have students
nested in schools nested in school districts, you need to preserve the
nesting: if a particular student is assigned to one group
(discovery or confirmatory), then the student's school and school
district need to be assigned to the same group.
- Using a type I error rate or false
discovery rate may
not be obligatory in the discovery phase, but may be practical to help
you keep the number of hypotheses you generate down to a level that
you will be able to test (with a reasonable bound on Type I error rate
or false discovery rate, and a reasonable power) in the confirmatory
phase
- A preliminary consideration of Type I
errors and power
should be done to help you make sure that your confirmatory data set is
large enough. Be sure to then give further thought to consequences of
Type I and II errors for the hypotheses you generate with the discovery
data set, and set an overall Type I error rate (or false discovery
rate) for the confirmatory stage.
III. Report your results carefully, aiming for honesty and transparency
- State clearly the questions you set out to
study.
- State your methods, and
your reasons for choosing those methods. (e.g., why you chose
the inference procedures you used; why you chose the Type I error rate
and power that you used)
- Give details of how your data were
collected.
- State clearly what (if anything) was data
snooping, and
how you accounted for it in overall Type I error rate or False
Discovery Rate.
- Include a "limitations" section, pointing
out any
limitations and uncertainties in the analysis (e.g., if power was not
large enough to detect a practically significant difference; any
uncertainty in whether model assumptions were satisfied; if there
was possible confounding; if missing data created additional
uncertainty, etc.)
- Be careful not to inflate or over-interpret conclusions, either
in
the abstract or in the results or conclusions sections.
Notes:
1. This is assuming a Bonferroni
procedure. If another multiple
inference procedure is available, it might give an effective individual
significance level somewhat higher than 0.02/15.
This page last revised 11/3/2011