COMMON MISTEAKS
MISTAKES IN
USING STATISTICS: Spotting and Avoiding Them
Suggestions for
Researchers
The most common error
in statistics is to assume that
statistical procedures can take the place of sustained effort.
Good and Hardin (2006) Common Errors in Statistics, p. 186
The hard part, and the one where training
is so poor, is the a priori thinking about the science of the matter
before data analysis -- even before data colleciton. It has been too
easy to collect data on a large number of variables in the hope that a
fast computer and sophisticated software will sort out the important
things -- the "significant" ones (the "just the numbers" approach).
Instead, a major effort shoud be mounted to understand the nature of
the problem by critical examination of the literature, talking with
others working on the general problem, and thinking deeply about
alternative hypotheses. Rather than "test" dozens of trivial matters
... there must be a more concerted effort to provide evidence on
meaningful questions that are important to a discipline. This is the
critical point: the common failure to address important science
questions in a fully competent fashion.
Burnham and
Anderson (2002) Model Selection and
Multimodel Inference, pp. 144 - 145
Throughout: Look
for, take into account, and report sources of uncertainty.
Specific
suggestions for planning research:
- Decide what questions you
will be studying.
- Trying to study too many
things at once is likely to create problems with multiple testing, so
it may be wise to limit your study.
- If you will be gathering
data, think about how you will gather and analyze it before you start to gather the
data.
- Read reports on related
research, focusing on problems that were encountered and how you might
get around them and/or how you might plan your research to fill in gaps
in current knowledge in the area.
- If you are planning an
experiment, look for possible sources of variability and design
your experiment to take these into account as much as possible.
- If you are gathering
observational data, think about possible confounding factors and plan
your data gathering to reduce confounding.
- Be sure to record any time and spatial variables present,
or any other variables that might influence outcome, whether or not you
initially plan to use them in your analysis.
- Also think about any
factors that might make the sample biased.
- You may need to limit
your study to a smaller population than originally intended.
- Think carefully about
what measures you will use.
- If your data gathering
involves asking questions, put careful thought into choosing and phrasing them. Then
check them out with a test-run and revise as needed.
- Think carefully about how
you will randomize (for an experiment) or sample
(for an observational
study).
- Think carefully about
whether or not the model assumptions
of your intended method of analysis are likely to be reasonable.
- If not, revise either
your plan for data gathering or your plan for analysis, or both.
- Conduct a pilot study to
trouble shoot and obtain variance estimates for a power
analysis.
- Decide on appropriate
levels of Type I and Type II error, taking
into account consequences of each type of error.
- Plan how to deal with multiple inferences, including "data snooping" questions that might arise
later
- Do a power
analysis to estimate what sample size you need to detect meaningful
differences.
- If you plan to use
existing data, modify the suggestions above, as in the suggestions
under Item II(b) of Data Snooping. See
Burnham and Anderson (2002) Model
Selection and Multimodel Inference for advice on model
selction. If
your interest is causal inference, see Rubin,
Donald B. (2008), For objective causal inference, design trumps
analysis, Annals of Applied
Statistics 2(3), 808 - 840.
- For additional
suggestions, see Chapter 8 of van Belle (2008), Statistical Rules of Thumb.
Specific
suggestions for analyzing data:
- Before doing
any formal analysis, ask whether or not the model assumptions of the procedure
are plausible in the context of the data.
- Plot the data (or residuals, as
appropriate) as possible to get additional checks on whether or not
model assumptions hold.
- If model
assumptions appear to be violated, consider transformations of the
data,
or use alternate methods of analysis as appropriate.
- If more than
one statistical inference is used, be sure to take that into account by
using appropriate methodology for multiple
inference.
- If you use
hypothesis tests, be sure to calculate corresponding confidence
intervals as well.
- But be aware
that there may also be other sources of uncertainty not captured by
confidence intervals.
- Keep careful records of
decisions made in data cleaning and in using software.1
Specific
suggestions for writing up research:
Critics
may complain that we advocate interpreting reports not merely with a
grain of
salt but with an entire shaker; so be it. ... Neither society nor we
can afford
to be led down false pathways.
Good and Hardin (2006), Common Errors in Statistics, p.
119
Until a happier future
arrives, imperfections in models
require further thought, and routine disclosure of imperfections would
be
helpful.
David
Freedman (2008, p. 61)
- Aim for
transparency and reproducibility.
- Include enough detail so
the reader can critique both the data gathering and the analysis.
- Look for and report
possible sources of bias2
or other
sources of additional uncertainty in
results.
- For more detailed
suggestions on recognizing and reporting bias, see Chapter 1 and
pp. 113 - 115 of Good and Hardin (2006). All of Chapter 7 of that book
is a good supplement to the suggestions here.
- Consider including a
"limitations" section, but be sure to reiterate or summarize the
limitations in stating conclusions -- including in the abstract.
- Include enough detail so
that another researcher could replicate both the data gathering and the
analysis.
- For example, "SAS Proc
Mixed was used" is not adequate detail. You
also need to explain which factors were fixed, which random, which
nested, etc. Refer to the notes you have made when performing the
analysis.
- If space limitations do
not permit all the detail needed to be included in the actual paper,
provide them in a website to accompany the article.
- Some journals now include
websites for supplementary information; publish in these when
possible.
- When citing sources, give
explicit page numbers, especially for books.
- Include discussion of why the analyses used are
appropriate
- If you do hypothesis
testing, be sure to report p-values (rather than just phrases such as
"significant at the .05 level") and also give confidence
intervals.
- In some situations, other
measures such as "number to treat" would be appropriate. See pp.
151 - 153 of van Belle (2008) for more discussion.
- Be careful to use
language (both in the abstract and in the body of the article) that
expresses any uncertainty and limitations.
- If you have built a
model, be sure to explain the decisions that went into the selection of
that model
- See Good and Hardin
(2006, pp. 181 – 182) for more suggestions
- For more suggestions and
details, see
- Chapters 8 and 9 of van
Belle (2008)
- Chapters 7 and 9 of Good
and Hardin (2006)
- Harris et al (2009)
- Miller (2004)
- Robbins (2004)
- Strasak et al (2007)
References:
K. P. Burnham and D. R.
Anderson (2002), Model selection and
Multimodel Inference: A Practical Information-Theoretic Approach,
2nd ed., Springer
D. Freedman (2008), Editorial: Oasis or
Mirage?, Chance v. 21 No 1,
pp. 59 -61
P. Good and J. Hardin (2006), Common
Errors in Statistics (and How to Avoid Them), Wiley
Harris, A. H. S., R. Reeder and J. K. Hyun (2009), Common statistical
and research design problems in manuscripts submitted to high-impact
psychiatry journals: What editors and reviewers want authors to know,
Journal of Psychiatric Research,
vol 43 no15, 1231 -1234
Miller, Jane (2004), The Chicago
Guide to Writing about Numbers: The
Effective Presentation of Quantitative Information, University
of
Chicago Press
Robbins, N. (2004), Creating More
Effective Graphs, Wiley
Strasak, A. M. et al (2007), The Use of Statistics in Medical Research,
The American Statistician,
February 1, 2007, 61(1): 47-55
van Belle, G. (2008) Statistical
Rules of Thumb, Wiley
Notes:
1. For more discussion, see:
2. Nobel
Laureate in Physics Richard Feynman offers good advice:
"The only way to have real success
in science ... is to describe the
evidence very carefully without regard to the way you feel it should
be. If you have a theory, you must try to explain what's good and
what's bad about it equally. In science, you learn a kind of standard
integrity and honesty.
What
Do You Care What Other
People Think? (1988) p. 217
There is one feature I notice that
is generally missing in "cargo cult
science"... It's a kind of scientific integrity, a principle of
scientific thought that corresponds to a kind of utter honesty —
a kind of leaning over backwards... For example, if you're doing an
experiment, you should report everything that you think might make it
invalid — not only what you think is right about it... Details
that could throw doubt on your interpretation must be given, if you
know them. ... If you make a theory, for example, and advertise it, or
put it out, then you must also put down all the facts that disagree
with it, as well as those that agree with it. ... In summary, the idea
is to try to give
all of the
information to help others to judge the value of your contribution; not
just the information that leads to judgment in one particular direction
or another. ... The first principle is that you must not fool yourself
-- and you are the easiest person to fool. So you have to be very
careful about that.
"Cargo Cult Science",
adapted from a commencement address given at Caltech (1974)"
Last updated September 8, 2012