In the big picture, the goal of science is to build on our knowledge of how the world works. This can take the form of being clear about what we know and what we don’t know, and it can also be seen when scientists try to quantify their degree of certainty (or uncertainty) about a particular result. In formal, peer-reviewed science this role is often performed by statistical significance testing: an attempt to quantitatively assess whether or not a particular result helps us to build our knowledge of the world.
The idea of a rational, dispassionate way to judge scientific results is alluring, but there are many ways in which this ideal does not perfectly translate into practice. Scientists increasingly have doubts about the extent to which traditional statistical significance testing can help us to continue building knowledge. And there is a broader concern that results in the scientific literature cannot be reliably reproduced, calling the whole endeavour into question.
This is obviously a serious issue and there has been a lot of ink spilled over the topic. My intention here is not to rehash what has been said better elsewhere, but instead to provide a list of references that I have found invaluable and that have shaped my own outlook on statistical significance testing and scientific progress. I’ve broken these references into two broad categories: reproducibility in science; and some of the problems with statistical significance testing (in particular, problems with p-values).
Reproducibility in science
In 2005, John Ioannidis wrote that
the probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field.
Running simulations under this framework, and using what we know about research and peer review, led him to predict that “it is more likely for a research claim to be false than true” (Ioannidis 2005).
A decade later, things had not improved. Denes Szucs and John Ioannidis (Szucs and Ioannidis 2017) teamed up to look at effect sizes and study power in cognitive neuroscience and psychology. They found that
p value errors positively correlated with journal impact factors
and, again, that
False report probability is likely to exceed 50% for the whole literature.
Drawing attention to the human side of the scientific endeavour, Regina Nuzzo argued that the gambler’s fallacy, asymmetrical attention, and our love of stories all feed into the difficulty of producing reproducible research. To combat these factors, she suggests focusing on transparency, using techniques such as blind data analysis, and building a team of rivals. (Nuzzo 2015)
Perhaps all is not lost, however. Amerhein, Tramifow, and Greenland proposed that inferential statistics should be seen as descriptions of the underlying research and not evaluations of significance. Citing Ronald Fisher,
no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon.
They suggest then that the floodgates should be opened for reporting results where our questions made sense before we collected our data. And this should be paired with emphatically closing the door on trying to draw conclusions from the results of a single study. (Amrhein, Trafimow, and Greenland 2019)
In a similar vein, Chris Drummond argued that reproducibility should not be confused with replicability and that the former, by necessarily requiring changes between investigations, is a much stronger method for building a literature of reliable scientific results within our limitations for exploring the universe. (Drummond 2009)
Problems with statistical significance (and p-values in particular)
Regina Nuzzo concisely explained why p-values are at the root of so many troubles in her instant classic on statistical errors. Far from the first author to question the usefulness of p-values, she highlights that when
statistician Ronald Fisher introduced the P value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look.
Bigger picture, she underscores the difference between the statistical significance and relevance of a finding, encapsulated by Geoff Cumming’s call for us to ask ‘How much of an effect is there?’ and not ‘Is there an effect?’ (Nuzzo 2014)
One result of the difficulty with using p-values as a proxy for publication suitability? John Ioannidis conducted an extensive analysis in 2019 showing that the p-values reported in papers’ abstracts are more significant than those reported in the full-text and that papers accepted by more competitive journals in the basic science showed more evidence of “cherry picking” reported results. (Ioannidis 2019)
Around the same time, Amrhein, Greenland, and McShane argued that a core problem with statistical significance testing is that
bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different.
Instead of bright lines of demarcation between “significant” and “not significant”, it is important to remember that “[v]alues just outside the interval [of statistical significance] do not differ substantively from those just inside the interval.” (Amrhein, Greenland, and McShane 2019)
Troubles with p-values and their use in establishing the “significance” of scientific results led one journal to explicitly state that null hypothesis significance testing was not required for manuscripts (Trafimow and Marks 2015) and led the American Statistical Association to put out a statement on the context, process, and purpose of these values (Wasserstein and Lazar 2016), with highlights including:
We teach [null hypothesis significance testing with p < 0.05 indicating significance] because it’s what we do; we do it because it’s what we teach
P-values can indicate how incompatible the data are with a specified statistical model. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold
These difficulties have also led individual authors such as Farrel Buchinsky and Neil Chadha (Buchinsky and Chadha 2017) to argue specifically for Bayesian statistics:
Instead of working backward by calculating the probability of our data if the null hypothesis were true, Bayesian statistics allow us instead to work forward, calculating the probability of our hypothesis given the available data.
Reading List
- J. P. A. Ioannidis, “Why Most Published Research Findings Are False,” PLoS Med., vol. 2, no. 8, p. e124, Aug. 2005. DOI
- D. Szucs and J. Ioannidis, “Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature,” PLoS Biol., vol. 19, no. 3, 2017. DOI
- R. Nuzzo, “Fooling ourselves,” Nature, vol. 526, no.7572, pp. 182-185, 2015. DOI
- V. Amrhein, D. Trafimow, and S. Greenland, “Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if We Don’t Expect Replication,” Am. Stat., vol. 73, no. sup1, pp. 262–270, 2019. DOI
- C. Drummond, “Replicability is not Reproducibility: Nor is it Good Science” in Proc. Evaluation Methods for Machine Learning Workshop, 26th ICML, Montreal, Canada, 2009. LINK
- R. Nuzzo, “Statistical errors: P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume.,” Nature, vol. 506, no. 7487, pp. 150–152, 2014. DOI
- J. P. A. Ioannidis, “What Have We (Not) Learnt from Millions of Scientific Papers with P Values?,” Am. Stat., vol. 73, no. sup1, pp. 20–25, 2019. DOI
- V. Amrhein, S. Greenland, and B. McShane, “Scientists rise up against statistical significance,” Nature, vol. 567, no. 7748, pp. 305–307, Mar. 2019. DOI
- D. Trafimow and M. Marks, “Editorial,” Basic Appl. Soc. Psych., vol. 37, no. 1, pp. 1–2, 2015. DOI
- R.L. Wasserstein and N.A. Lazar, “The ASA’s statement on p- values: Context, process, and purpose,” Am. Stat., vol. 70, no. 2, p. 129–133, 2016. DOI
- F.J. Buchinsky and N.K. Chadha, “To P or Not to P: Backing Bayesian Statistics,” Otolaryngol. Head Neck Surg., vol. 157, no. 6, pp. 915-918, 2017. DOI
1 Comment. Leave new
[…] how traditional null hypothesis significance testing (NHST) and its related p-values is tied to concerns about reproducible science. One of the ways that Bayesian data analysis improves upon NHST is by asking the right question. […]