Introduction        Types of Mistakes        Suggestions        Resources        Table of Contents         About    Glossary    Blog

Frequentist Hypothesis Tests and p-values

As discussed on the page Overview of Frequentist Hypothesis Tests, most commonly-used frequentist hypothesis tests involve the following elements:

   1. Model assumptions 
   2. Null and alternative hypothesis
   3. A test statistic. This needs to have the property that extreme values of the test statistic cast doubt on the null hypothesis.
   4. A mathematical theorem saying, "If the model assumptions and the null hypothesis are both true, then the sampling distribution of the test statistic has this particular form."

The exact details of these four elements will depend on the particular hypothesis test; see the page linked above for elaboration in the case of the large sample z-test for a single mean.

The discussion on that page does not go into much detail on the concept of sampling distribution. This page will elaborate on that. First, read the page
Overview of Frequentist Confidence Intervals, which illustrates the idea of sampling distribution for the sample mean. Then read the following adaptation of that idea to the situation of a one-sided t-test for a population mean. This will illustrate the general concepts of of p-value and hypothesis testing as well as sampling distribution for a hypothesis test.
values of sampling distribution

If the t-statistic lies at the red bar (around 0.5) in the picture, nothing is surprising; our data are consistent with the null hypothesis. But if the t-statistic lies at the green bar (around 2.5), then the data would be fairly unusual -- assuming the null hypothesis is true. So a t-statistic at the green bar would cast some reasonable doubt on the null hypothesis. A t-statistic even further to the right would cast even more doubt on the null hypothesis.1


We can quantify the idea of how unusual a test statistic is by the p-value. The general definition is:

p-value = the probability of obtaining a test statistic at least as extreme as the one from the data at hand, assuming the model assumptions and the null hypothesis are all true.  

Recall that we are only considering samples, from the same random variable, that fit the model assumptions and of the same size as the one we have. 2

So the definition of p-value, if we spell everything out, reads

p-value = the probability of obtaining a test statistic at least as extreme as the one from the data at hand, assuming the model assumptions are all true, and the null hypothesis is true, and the random variable is the same (including the same population), and the sample size is the same.

Comment: The preceding discussion can be summarized as follows:
 If we obtain an unusually small p-value, then (at least) one of the following must be true:
The interpretation of "at least as extreme as" depends on the alternative hypothesis:
Note that, for samples of the same size2the smaller the p-value, the stronger the evidence against the null hypothesis, since a smaller p-value indicates a more extreme test statistic. If the p-value is small enough (and assuming all the model assumptions are met), rejecting the null hypothesis in favor of the alternate hypothesis can be considered a rationale decision.


  1. How small is "small enough" is a judgment call.
  2. "Rejecting the null hypothesis" does not mean the null hypothesis is false or that the alternate hypothesis is true.

These comments are discussed further on the page Type I and II Errors.

1. A little algebra will show that if t = (ȳ - µ0)/(s/√n) is unusually large, then so is ȳ, and vice-versa.

2. Comparing p-values for samples of different size is a common mistake. In fact, larger sample sizes are more likely to detect a difference, so are likely to result in smaller p-values than smaller sample sizes, even though the context being examined is exactly the same.

Last updated Jan 20, 2013