COMMON MISTEAKS MISTAKES IN USING STATISTICS: Spotting and Avoiding Them

# Data Snooping

Data snooping refers to statistical inference that the researcher decides to perform after looking at the data (as contrasted with pre-planned inference, which the researcher plans before looking at the data).

Data snooping can be done professionally and ethically, or misleadingly and unethically, or misleadingly out of ignorance.  Data snooping
misleadingly out of ignorance is a common error in using statistics. The problems with data snooping are essentially the problems of multiple inference.

One way in which researchers unintentionally obtain misleading results by data snooping is in failing to account for all of the data snooping they engage in. In particular, in accounting for Type I error when data snooping, you need to count not just the actual hypothesis tests performed, but also all comparisons looked at when deciding which post hoc (i.e., not pre-planned) hypothesis tests to try.

Example:  A group of researchers plans to compare three dosages of a drug in a clinical trial.  There is no pre-planned intent to compare effects broken down by sex, but the sex of the subjects is recorded. The researchers have decided to have an overall Type I error rate of 0.05, allowing 0.03 for the pre-planned inferences and 0.02 for any data snooping they might decide to do. The pre-planned comparison shows no statistically significant difference between the three dosages when the data are not broken down by sex. However, since the researchres have recorded sex of the patients, they decide to look at the outcomes broken down by combination of sex and dosage. They notice that the results for women in the high-dosage group look much better than the results for the men in the low dosage group, and perform a hypothesis test to check that out. In accounting for Type I error, the researchers need to take the number of data-snooping inferences performed as 15, not one. The reason is that they have looked at fifteen comparisons:  there are 3×2 = 6 dosage-by-sex combinations, and hence (6×5)/2 = 15 pairs of dosage-by-sex combinations. Thus the significance level for the post hoc test should not be 0.02. but 0.02/151.

For some discussions of multiple inference and data snooping with a humerous slant, see:

Suggestions for data snooping professionally and ethically

I. Educate yourself on the limitations of statistical inference: Model assumptions,  the problems of Types I and II errors, power, and multiple inference, including the "hidden comparisons" that may be involved in data snooping (as in the above example).

II. Plan your study to take into account the problems involving model assumptions, Type I and II errors, power, multiple inference. Some specifics to consider:

a. If you will be gathering data, decide before gathering the data:
• What questions you are trying to answer.
• How you will gather the data, and the inference procedures you intend to use to help answer your questions. These need to be planned together, to maximize the chances that the data will fit the model assumptions of the inference procedures.
• Whether or not you will engage in data snooping.
• The Type I error rate (or the false discovery rate) and power that would be appropriate (considering the consequences of Type I and Type II  errors in the situation you are studying).  Be sure to allow some portion of Type I error for any data snooping you think you might do.
Then do a power analysis to see what sample size is needed to meet these criteria.
• Take into account any relevant considerations such as intent-to-treat analysis or how you will deal with missing data.
•  If the sample size needed is too large for your resources, you will need to either obtain additional resources or scale back the aims of your study.
b. If you plan to use existing data, you will need to go through a process similar to that in (a) before looking at the data:
• Decide what questions you are trying to answer.
• Find out how the data were gathered.
• Decide on inference procedures that i) will address your questions of interest and ii) have model assumptions compatible with how the data were collected. If this turns out to be impossible, the data are not suitable.
• Decide whether or not you will engage in data snooping.
• Decide the type I error rate (or false discovery rate) and power that would be appropriate (considering the consequences of these Type I and Type II errors in the situation you are studying).  Remember to allow some portion of Type I error for data snooping, if you are likely to engage in any.
Then do a power analysis to see what sample size is needed to meet these criteria.
• Take into account any relevant considerations such as intent-to-treat analysis or how you will deal with missing data.
•  If the sample size needed is larger than the available data set, you will need to either scale back the aims of your study, or find or create another larger data set.
c. If data snooping is intended to be the purpose or an important part of your study, then before you look at the data, divide it randomly into two parts: One to be for used for discovery purposes (generating hypotheses), the other to be used for confirmatory purposes (testing hypotheses).
• Be careful to do the randomization in a manner that preserves the structure of the data. For example, if you have students nested in schools nested in school districts, you need to preserve the nesting: if a particular student is assigned to one group (discovery or confirmatory), then the student's school and school district need to be assigned to the same group.
• Using a type I error rate or false discovery rate may not be obligatory in the discovery phase, but may be practical to help you keep the number of hypotheses you generate down to a level that you will be able to test (with a reasonable bound on Type I error rate or false discovery rate, and a reasonable power) in the confirmatory phase
• A preliminary consideration of Type I errors and power should be done to help you make sure that your confirmatory data set is large enough. Be sure to then give further thought to consequences of Type I and II errors for the hypotheses you generate with the discovery data set, and set an overall Type I error rate (or false discovery rate) for the confirmatory stage.
III. Report your results carefully, aiming for honesty and transparency
• State clearly the questions you set out to study.
• State your methods, and your reasons for choosing those methods. (e.g., why you chose the inference procedures you used; why you chose the Type I error rate and power that you used)
• Give details of how your data were collected.
• State clearly what (if anything) was data snooping, and how you accounted for it in overall Type I error rate or False Discovery Rate.
• Include a "limitations" section, pointing out any limitations and uncertainties in the analysis (e.g., if power was not large enough to detect a practically significant difference; any uncertainty in whether model assumptions were satisfied; if there was possible confounding; if missing data created additional uncertainty, etc.)
• Be careful not to inflate or over-interpret conclusions, either in the abstract or in the results or conclusions sections.

Notes:
1. This is assuming a Bonferroni procedure. If another multiple inference procedure is available, it might give an effective individual significance level somewhat higher than 0.02/15