Hypothesis Testing - an Overview

Based on the “Statistical Consulting Cheatsheet” by Prof. Kris Sankaran

Many problems in consulting can be treated as elementary testing problems. First, let’s review some of the philosophy of hypothesis testing. Testing provides a principled framework for filtering away implausible scientific claims. It’s a mathematical formalization of Karl Popper’s philosophy of falsification. The underlying principle is simple: “ Reject the null hypothesis if the data are not consistent with it, where the strength of the discrepancy is formally quantified through the notion of p-value”.

The Objective: Measuring the strength of discrepancy by computing a p-value

Consequently, one of the main goals of hypothesis testing is to compute a p-value. A p-value can be defined as “the probability of observing an event as extreme as what I am observing under the null”, where the null is the default, “chance” scenario.
Example: For instance, suppose that I want to assess if Soda A is better than Soda B. I could do a survey, and ask people to give a score to each of the soda, and average my results. Suppose there is no difference between the two: the difference between the two averages is a random variable, centered at 0. Conversely, if there is a true underlying difference (let’s call it \(\delta\)), then the difference between my two averages: \(\Delta = \bar{X}_a -\bar{X}_b\) is also a random variable, but centered at \(\delta\). The entire point of hypothesis testing becomes to quantify how extreme this difference \(\Delta\) has to be to “reject the null” — i.e, to say that it is unlikely for \(\Delta\) to be this extreme if the null (“there is no difference in sodas”) is true.This is the concept at the core of p-values: a p-value of 0.04 means that, just by chance, only 4% of events would have seen a difference this big. If I am willing to accept that 4% is too small (statisticians usually abide by the convention that anything less than 5% chance is unlikely to happen by chance alone), I can reject the null.

A small p-value, typically < 0.05, indicates strong evidence against the null hypothesis; in this case you can reject the null hypothesis. On the other hand, a large p-value, > 0.05, indicates weak evidence against the null hypothesis; in this case, you do NOT reject the null hypothesis. The value 0.05 is the threshold usually employed by the community — you can think of it as a scientific convention for determining significance.

Importantly, the p-value is the probability of observing events as extreme as my observations under the null, but not the probability that the hypothesis is correct!

\[p_{value} = \mathbb{P}[\text{observations} \; \mid \; \text{hypothesis } H_0 ] \ne P[ \text{hypothesis } H_0 \; \mid \; \text{observations} ]\]


P-values should NOT be used a “ranking”/“scoring” system for your hypotheses.

The Recipe

Of course, to determine what this p-value is, there are three essential steps:

While testing is fundamental to much of science, and to a lot of our work as consultants, there are some limitations we should always keep in mind:

The Ingredients

To find the right hypothesis test, we need to select the right “ingredients”. That requires to answer a minimum of four questions:

Prospective Measurements: Finally, if you haven’t done your measurements yet and you’re looking to assess how many samples you would need to answer your question, do look at our page on power analysis.