Summary

The p-value (explained here) can be interpreted as a measure of how surprising you’d find the data collected in a study under the assumption that the intervention/variable being studied made no difference to the outcome. But how surprising is too surprising to justify holding on to that null hypothesis?

Researchers can quantify what would make for a big enough surprise by setting a significance level for the p-value. If the p-value falls below this significance level, then the results are called statistically significant. The researchers can then reject the null hypothesis and state that their intervention (or experimental variable) actually made a difference to the outcome.

Statistical significance is a threshold for p-values below which the null hypothesis is rejected, and the claim can be made that the experimental intervention (or variable) actually made a difference to the outcome.

It’s important to emphasize that a statistically significant result just means that the null hypothesis doesn’t account for the data well. It doesn’t mean that the result is of practical significance! For example, if a supplement called Better Weight helps people lose 0.0001 kg, a large enough experiment will reveal a statistically significant difference. But who cares? Losing that much weight wouldn’t make any practical difference. A statistically significant difference just says that there may actually be a difference; it can’t tell you whether that difference is important.

Another reason to take statistical significance with a grain of salt is that statistically significant results are only as good as the methods that generated them — and those methods may not be solid. For example, there may be study design issues, or the authors may have used statistical methods that don’t apply to their study design. Whatever the reason, statistically significant results aren’t trustworthy if the study and data analysis aren’t. As the saying goes: garbage in, garbage out.

Statistical significance does not imply practical significance. Also, statistically significant results are only as trustworthy as the methods that generated them: study design and choice of an applicable statistical method matter. The devil’s often in the details, so just because something’s statistically significant doesn’t mean it’s based in good science.

Most research articles set a significance level of 0.05, meaning they treat p-values below this level as statistically significant. This number has a special meaning: over many experiments, you’ll have at most a 5% chance of statistically significant results being false positives (i.e., the null hypothesis will be rejected even when it’s true and there’s no actual difference). While you can’t tell which specific results are false positives, keeping the false positive risk low by setting a low significance level can help you be wrong less often. However, this is only true if the assumptions of the analyses all hold (and they may not!) and that multiple comparisons are taken into account (which often aren’t!). But, if all things go well, having a 5% probability of rejecting the null hypothesis when you shouldn’t have isn’t a bad error rate.

The specific cutoff of 5% is a bit arbitrary, though. A higher significance level would give more false positives, but maybe that’s a risk you’re willing to take if the benefits of missing out on a possible effect outweigh the costs of being wrong. On the other hand, having a lower significance level can make you more confident that an effect actually exists so that you don’t waste time, money, and risk side effects on an intervention that doesn’t do anything. Significance is subjective at the end of the day! If you don’t want to miss out on an effect that would be really important to you, then a higher significance level may be warranted; when reading papers, you can decide to presume that results with p-values under, say, 0.10 reflect real effects if you want to take a risk. This may be a good idea if the intervention you’re looking at seems low-risk, isn’t expensive for you, or you strongly desire the effect. On the other hand, if you don’t want to be fooled as often, a lower level is what you’re looking for; you can decide to presume that p-values that are, say, under 0.02, are reliable. This is a wise way to go if there are possible side effects, if the supplement is expensive or inconvenient, or if the purported effect wouldn’t benefit you much.

The typical 0.05 cutoff for statistical significance isn’t sacred — sometimes higher or lower cutoffs are warranted. When reading a paper, feel free to presume that effects that have slightly higher p-values are reliable as long as you’re willing to take the risk that it doesn’t work. If you are more averse to risk, then feel free to tighten your standards and only presume that results that have even lower p-values than the standard cutoff reflect real effects.