This
site is under construction. Please check back every few weeks for
updates
COMMON MISTEAKS
MISTAKES IN
USING STATISTICS: Spotting and Avoiding Them
Introduction
Types
of Mistakes
Suggestions
Resources
Table
of Contents
About
Misunderstandings Involving
Conditional Probabilities
The
basic
idea of conditional probabilities.
A conditional
probability is
a probability with some condition imposed.
In practice, many probabilities we encounter are
conditional probabilities, although that is not always made explicit.
For example, the phrase "the probability of dying of a heart attack in
the next five years" is, in practice, typically ambiguous. Is the
writer talking about the world-wide probability, over both sexes and
all ages? Or is there an implicit assumption that the probability in
question just refers to adults living now in one particular country?
Being clear on the condition is important; for example, we would expect
that the probability of dying of a heart attack in
the next five years is much less for
men under 25 years
of age
than for men over 65 years of age.
Misunderstandings
arising from ignoring the condition
Many research studies involving people study a fairly restricted group.
Thus they result in conditional probabilities with a fairly restricted
condition. Unfortunately, all too often this restriction is not
emphasized enough. For example, a study of a cholesterol-lowering
medication might be restricted to men between the ages of 45 and 65 who
have previously had a heart attack. If a physician decides, on the
basis of this study, to prescribe the medication to a woman who is 70
years old and has no previous record of heart attacks, the physician is
extrapolating;
the
applicability of the study to this quite different group of people is
questionable.
Terminology
and notation for conditional probabilities
A
conditional probability is often expressed using the
phrase "given" to describe the conditon. For example, the phrase "the
probability of dying of a heart attack in the next five
years for
men under 25 years of age"
would be expressed as, "The probabilty of dying of a heart attack in
the
next five years given that the person is a man under 25."
The notation P( ) is often used to express a probability of something.
To express a conditional probability, we use a vertical bar to stand
for "given". For example,
P(dying of a heart attack in the next
five years |
male under 25 years of age)
stands for "the probability of dying of a heart
attack in
the next five years for
men under 25 years of age,"
and is read "The probability
of dying of a heart attack in the
next five years given male under 25 years of age" (which, admittedly,
is not very good English).
Confusion of reverse
conditional
probabilities
One
common misunderstanding is confusing a conditional
probability with the reverse
conditional probability --
that is, with
conditonal probability that reverses the roles
of the event
(e.g., "dying of a heart attack in the next five
years") and condition (e.g., "male under 25 years of
age").
For
example,"the
probability of dying of a heart
attack in
the next five years for
men under 25 years of age,"
is talking about something
quite different from "the probability
of being a male under 25 years of age if one dies of a heart attack." Sometimes
this is called confusion of the
inverse.
One situation where this type of confusion is very common is in
connection with diagnostic tests for medical conditions.
Diagnostic tests typically have two outcomes, labeled "positive" and
"negative." For an ideal test, the outcome is "positive" exactly
when the patient has the disease being tested for, and "negative"
exactly when the patient does not have the disease.
Unfortunately, diagnostic tests are almost never perfect. Thus we talk
about their sensitivity
and their specificity:
Sensitivity
= the probability that a person tests positive if the disease
is
present
=
P(tests positive | has the
disease)
Specificity
=the probability that a person tests negative if the disease is absent
=
P(tests negative | disease
absent)
Many people (physicians as well as patients) confuse the
sensitivity P(tests
positive | has the disease) with the
reverse conditional probability P(has the disease | tests positive).
This reverse conditional probability is called the positive predictive value,
also denoted PPV:
PPV
= Positive predictive value =
the probability that
someone has
the disease if they test positive
=
P(has the disease | tests
positive).
In fact, the sensitivity and
PPV can be very different. In
particular, the sensitivity might be very high while the PPV is low.
The PPV depends on the sensitivity, but also on the specificity and the
prevalence
rate of the
disease:
Prevalence
rate = the proportion of the
population having the disease.
Note that the prevalence rate
refers to a specific reference
category -- in this case, a
certain population. The prevalence rate
will vary
according to the population. For example, in most countries, the
prevalence rate of having the HIV virus is greater for the population
of intravenous drug users than for the population at large.
The way in which the PPV depends on the sensitivity,
specificity, and prevalence rate is sufficiently involved to be
counterintuitive for most people. In particular, a test can have what
seem like high
sensitivity and high specificity, yet have low PPV. For more about this
relationship, see the Notes.
Notes:
The following references give more information on the relationship
between PPV, sensitivity,
specificity, and prevalence rate:
"Positive predictive value", Wikipedia,
http://en.wikipedia.org/wiki/Positive_predictive_value
, accessed
November 8, 2009.
Gives
the formula relating PPV,
sensitivity,
specificity, and prevalence rate, plus some
examples and links to related discussions.
"Accuracy of Diagnostic
Tests," RDTinfo,
http://www.rapid-diagnostics.org/accuracy.htm,
accessed November 10, 2009.
Gives
the basic defnitions and
formulas, plus further references.
Gigerenzer, Gerd et al (2007)."Helping doctors and patients make
sense of health statistics," Psychological
Science in the Public Interest,
vo. 8, No. 2, pp. 53 - 96.
Download from http://www.psychologicalscience.org/journals/index.cfm?journal=pspi&content=pspi/8_2
Discusses
misunderstandings
involving the positive predicitve value as well
as other confusions that affect medical care. Also discusses ways to
explain the topics that can help improve understanding. A
somewhat shortened version has appeared as "Knowing your chances: What
health stats really mean," Scientific
American Mind, April/May/June
2009, pp. 44 - 51
Nugent, William (2004). "The role of prevalence rates, sensitivity, and
specificity in assessment accuracy: Rolling the dice in social work
process," Journal
of Social Service
Research, 31 (2), 51 - 75.
Discusses
the questions of
accuracy of diagnostic testing in the contest of mental disorders,
focusing examples on major depressive disorder. Includes discussion of
when it is better to try to improve sensitivity and when it is better
to improve specificity. Although the explanations of the math are
sometimes not the best, the article is very worthwhile in many ways. I
have used it as the basis of a couple of assignments in a course I have
taught for students in a master's program for secondary math teachers. (First assignment,
second
assignment)
Pepe,
M.S. (2003). The
Statistical Evaluation of Medical Tests for Classification and
Prediction, Oxford.
If
you're interested in even more
than the Nugent and Swets et al articles provide.
Swets,
John A., Robyn Dawes, and John Monahan
(2000). Psychological Science Can Improve Diagnostic Decisions, Psychological Science in
the Public
Interest 1(1). Download from http://www.psychologicalscience.org/journals/index.cfm?journal=pspi&content=pspi/1_1
A
fairly comprehensive
discussion of the question of improving diagnostic accuracy. Discusses
several other applications (e.g., predicting violence,
weather
forecasting, aircraft cockpit warnings) as well as medical diagnostics.