Multa Verba: New "p < 0.005" Standard Considered Harmful

Several days ago, an article preprint was posted on PsyArXiv (to appear in Nature Human Behavior), arguing that a significant step towards solving the reproducibility crisis in many fields will be to reduce the standard significance threshold to require p < 0.005 (rather than the current standard of 0.05) for a result to be termed "statistically significant". Results with higher p values could only be termed "suggestive" under this proposal.

I, along with a number of other scientists, was recently interviewed for a brief article in Nature | News which clearly, and concisely, sets out arguments for and against this proposal as a way to address the reproducibility crisis. I'd like to expand here on my very compressed comments as quoted in that article.

On the one hand, the proposal to reduce the "significance threshold" to 0.005 will indeed be helpful, in that the change will reduce the occurrence of false positives, and thus enhance the likelihood that effects termed "significant" will be reproducible, and so reduce the number of non-existent effects reported in the literature as real.

However, the proposal does nothing whatsoever to change the fundamental problem, which is the entire framework of judging experimental results by a binary criterion of "significance". There are two key problems with the regnant view of "statistical significance":

It does not account for effect size - a large effect with less-significant p may have much more "real-life significance" than a tiny effect with more-significant p, and
It is very easy to (even inadvertently) "p-hack" and find putatively significant effects due to the very large number of possible parameter settings for any experiment in a sufficiently rich and theoretically-underdetermined domain.

Given that, this new recommendation may even do significant harm, in that improving research outcomes somewhat (by lowering false positive rates) without addressing the fundamental problem could lead to a false sense of security and entrench bad methodological attitudes.

To expand on this, the problem I have with the new recommendation is that it maintains the exact same conceptual framework of statistical significance testing that has been a major driver of the replicability crisis.

In statistical significance testing, a level of confidence (previously 0.05, per the new proposal 0.005) is defined, and then the question asked is: Is the probability that the null hypothesis produced the experimental results less than that confidence level? If it is, the null hypothesis is rejected, and the hypothesis is considered statistically significant.

This framework does not account at all for the question of effect size - an effect may be statistically significant, but practically speaking negligible and irrelevant. For example, often the null hypothesis is of the form "x = 0" for some model parameter x. Any such null hypothesis is almost certainly false, since the actual value of x is probably at least slightly different from 0, and so some statistically significant result can nearly always be extracted with enough data and ingenuity, even though the result may not be significant in any meaningful way in the real world.

More fundamentally, no matter what confidence level you choose, if there are enough different ways to design your experiment, it becomes highly likely that at least one of them will give a statistically significant result just by chance. And unless every single detail of the experiment is decided and registered ahead of time, it is virtually impossible to ensure, except by care and good judgement by the researchers, that no choices are made along the way that could (unwittingly) enhance the significance of the final experimental results, since there are so many different choices. This phenomenon was termed the "garden of forking paths" by Gelman and Loken, and illustrated nicely by XKCD. (This problem is inherent in any metric-based decision criterion for identifying effects, and is not specific to statistical significance testing.)

Finally, the goal and the strength of the "p < 0.005" proposal is that it will reduce the number of false positive results in the literature. However, by the same token, there will be fewer reports of "less statistically significant" results, which will cause overall problems for the science, due to the "file-drawer" effect. The problem is that the vast majority of negative results are never reported, which gives a biased picture of the state of knowledge. By lowering the p-threshold, the file-drawer bias will be even greater, since fewer results, with only lower p-values, are published.

So how should we proceed?

There is no simple answer to this question, as there are several interrelated problems: p-hacking, effect sizes, and literature bias, as I've mentioned above, but also the question of how the incentive structures of modern science may bias the creation and evaluation of new knowledge (the file-drawer problem is an example of this). All of these factors contribute to the "reproducibility crisis", and all need to be addressed. Any simple answer to one of these, such as changing the threshold p-value, will not solve the problem, and may in fact make things worse by making it seem solved when it is not.

Any meaningful solution must involve a combination of new methodological standards, institutional structures, and incentives. Reducing the p threshold is simple and could be one useful methodological standard, though it would perpetuate problematic concepts, as I've said. Another would be introducing some notions from Bayesian statistics, such as (i) using Bayes factors, which are more easily interpreted in most situations than a p-value, as a standard reported metric, or (ii) the use of explicit priors, to make assumptions clearer, or (iii) explicit loss functions, to account explicitly for the importance of effect size and the varying weights of different kinds of errors.

It is essential that the scientific community have good incentives for publishing replications, especially those that fail, and for publishing negative results. (And there cannot be penalties for early-stage researchers for doing so!) There need to be more and better institutional structures for preregistering studies, like clinicaltrials.gov, but set up with the appropriate community standards and consequences, so that they actually work.

Just as during the early and laborious development of effective scientific practices and institutions several hundred years ago, there is no royal road to solving the crisis. A great deal of hard work lies ahead of us, but past experience has shown that it will (eventually) pay off in more solid understanding of our world.

Multa Verba

Wednesday, July 26, 2017

New "p < 0.005" Standard Considered Harmful

No comments:

Post a Comment

Blog Archive