Loading...

P-Value Controversies and Best Practices

The Replication Crisis and Its Causes

The replication crisis refers to the finding that a substantial proportion of published scientific findings cannot be replicated — when independent researchers attempt to repeat the original study, they obtain different results. The 2015 Reproducibility Project (Open Science Collaboration) attempted to replicate 100 psychology studies published in top journals. Results: only 36–39% replicated with a statistically significant result; the average effect size in replications was about half the original. The statistical roots of the replication crisis are multiple. Publication bias: journals systematically publish statistically significant results (p < 0.05) and reject non-significant ones — creating a biased literature that overrepresents true positives and false positives. P-hacking (questionable research practices, QRPs): researchers consciously or unconsciously test many hypotheses, collect data until p < 0.05 is achieved, or try multiple analyses and report only the significant one — exploiting researcher degrees of freedom to produce significant results without genuine effects. Underpowered studies: many studies have insufficient sample sizes to detect medium effects. Underpowered studies that do produce significant results are likely to have inflated (overestimated) effect sizes due to the winner's curse — only the largest sample deviations reach significance, overstating the true effect. Low base rate of true effects: in exploratory research, the prior probability that any given hypothesis is true may be quite low. Even with p < 0.05, a study testing a low-plausibility hypothesis has a high false discovery rate.

P-Value Controversies and Best Practices

The Replication Crisis and Its Causes

The replication crisis refers to the finding that a substantial proportion of published scientific findings cannot be replicated — when independent researchers attempt to repeat the original study, they obtain different results. The 2015 Reproducibility Project (Open Science Collaboration) attempted to replicate 100 psychology studies published in top journals. Results: only 36–39% replicated with a statistically significant result; the average effect size in replications was about half the original. The statistical roots of the replication crisis are multiple. Publication bias: journals systematically publish statistically significant results (p < 0.05) and reject non-significant ones — creating a biased literature that overrepresents true positives and false positives. P-hacking (questionable research practices, QRPs): researchers consciously or unconsciously test many hypotheses, collect data until p < 0.05 is achieved, or try multiple analyses and report only the significant one — exploiting researcher degrees of freedom to produce significant results without genuine effects. Underpowered studies: many studies have insufficient sample sizes to detect medium effects. Underpowered studies that do produce significant results are likely to have inflated (overestimated) effect sizes due to the winner's curse — only the largest sample deviations reach significance, overstating the true effect. Low base rate of true effects: in exploratory research, the prior probability that any given hypothesis is true may be quite low. Even with p < 0.05, a study testing a low-plausibility hypothesis has a high false discovery rate.