Summary
Let’s say you’re going to run a randomized controlled trial for Burnz, a fat burner. You’ll want to assess changes in fat mass, but while you’re at it, you might also decide to measure other outcomes, such as changes in muscle mass and training volume. After all, if you can also market Burnz as a performance enhancer and muscle builder, all the better!
You get the results and find a statistically significant reduction in fat mass, but no change in muscle mass or training volume. While the marketing department may be a bit disappointed, facts are facts. At least you now have evidence that Burnz promotes fat loss, right?
Maybe not. By measuring three different outcomes (fat mass, muscle mass, training volume), you tested three different null hypotheses, thereby increasing the risk of a false positive.
Statistical significance is directly related to the probability of an effect being a false positive. If you set your threshold for statistical significance at 0.05 — that is to say, if you call statistically significant any effect whose p-value falls at or under 0.05 (p ≤ 0.05) — your chance of being wrong is 5% (0.05 ✕ 100), and your chance of being right 95% (0.95 ✕ 100). Put another way, the statistically significant effect you found has a 5% chance of being a false positive and a 95% chance of being a true positive.
But if you test three different hypotheses, your chance of being right is 95% for each test, so 86% for the trial as a whole (0.95 ✕ 0.95 ✕ 0.95 = 0.86). The probability of a false positive has increased from 5% to 14%.
Therefore, the more hypotheses a study tests, the higher the risk of false positives. When a trial tests 10 hypotheses (which isn’t unusual), its chance of getting at least one false positive shoots to 40%. When it tests 100 hypotheses, it’s almost guaranteed to get at least one false positive.
The more hypotheses a study tests (the more outcomes it measures), the higher the risk of a statistically significant result being a false positive.
If you find the math confusing, think of a coin toss. It’s pretty unlikely that a fair coin will land heads 5 times in a row if you just flip it 5 times. But if you flip it 500 times, it becomes pretty likely that, at some point in the sequence, you’ll get 5 heads in a row. Not correcting for multiple comparisons is like calling the coin unfair (i.e., saying a result isn’t due to chance) because you found a 5-head run in a whole lot of flips.
Of the many ways to correct for multiple comparisons, the simplest is the Bonferroni correction: just divide your threshold by the number of hypotheses you tested and make the quotient your new threshold. So if you did three tests, as in our Burnz example above, your threshold would decrease from 0.05 to [0.05 ÷ 3 =] 0.017. If your p-value for the fat loss (the negative effect on fat mass you found when analyzing the data from your trial) was, let’s say, 0.02, then after correction this fat loss is no longer statistically significant.
While the Bonferroni method is indeed simple, it’s also too harsh; it can too easily generate false negatives (it can make you think there was no effect when there actually was one). There exist more accurate methods for the correction of multiple comparisons (such as the Holm-Bonferroni method, or the Benjamini-Hochberg procedure, which controls for false discovery rate), but they’re beyond the scope of this entry.
The main takeaway of this entry is that the risk of false positives increases as more tests are performed. If the authors of a study don’t correct for the number of tests they perform (i.e., if they don’t correct for multiple comparisons), you can do a quick correction in your head or just take the statistically significant results they find with a grain of salt. 🧂
If a study doesn’t correct for multiple comparisons, you may want to take the statistically significant results it reports less seriously.