Multa Verba: July 2017

Here is a short addendum to my post of yesterday, commenting on why I think that the proposal recently mooted to require a p < 0.005 for "statistical significance" is misguided and likely problematic.

There is another problem with that (admittedly well-intentioned) proposal, which arises because wide-spread implementation of the proposal would create a perverse incentive which could easily degrade the quality of the scientific literature by much more than it improves it.

As the authors acknowledge, lowering the alpha needed to declare "significance" would require larger sample sizes, and so data collection would entail greater difficulty and expense. They estimate that samples would need to be 70% larger - that may well be for studies of a single variable, though research that involves interaction effects would likely need a greater increase in the amount of data. Regardless, the manpower and expense would increase notably. (The authors see this as a potential benefit, in that "considerable resources would be saved by not performing future studies based on false premises.")

However, this extra work and expense creates a perverse incentive for researchers. The more you pay (in time, effort, money, etc.), the more you want to recoup your costs by producing publishable research. The harder such results are to produce, the harder you will look for them, after you have sunk the costs of getting the data.

Now, this is not a bad thing, if there is no way to cheat unwittingly. (Let's ignore entirely the possibility of fraud.) But if it is still possible (as it is today) to unknowingly p-hack, traipse through the garden of forking paths, HARK, etc., raising the cost of gathering a dataset will inevitably lead to a rise in p-hacked, forking pathed, and HARKed results.

The very change that was meant to improve the quality and integrity of the research literature will rather act to degrade it.

It seems to me that this underscores the fundamental need to address how incentives interact with methodologies and standards. Otherwise we are, at best, spitting in the wind, and at worst, using gasoline to put out the fire.

Several days ago, an article preprint was posted on PsyArXiv (to appear in Nature Human Behavior), arguing that a significant step towards solving the reproducibility crisis in many fields will be to reduce the standard significance threshold to require p < 0.005 (rather than the current standard of 0.05) for a result to be termed "statistically significant". Results with higher p values could only be termed "suggestive" under this proposal.

I, along with a number of other scientists, was recently interviewed for a brief article in Nature | News which clearly, and concisely, sets out arguments for and against this proposal as a way to address the reproducibility crisis. I'd like to expand here on my very compressed comments as quoted in that article.

On the one hand, the proposal to reduce the "significance threshold" to 0.005 will indeed be helpful, in that the change will reduce the occurrence of false positives, and thus enhance the likelihood that effects termed "significant" will be reproducible, and so reduce the number of non-existent effects reported in the literature as real.

However, the proposal does nothing whatsoever to change the fundamental problem, which is the entire framework of judging experimental results by a binary criterion of "significance". There are two key problems with the regnant view of "statistical significance":

It does not account for effect size - a large effect with less-significant p may have much more "real-life significance" than a tiny effect with more-significant p, and
It is very easy to (even inadvertently) "p-hack" and find putatively significant effects due to the very large number of possible parameter settings for any experiment in a sufficiently rich and theoretically-underdetermined domain.

Given that, this new recommendation may even do significant harm, in that improving research outcomes somewhat (by lowering false positive rates) without addressing the fundamental problem could lead to a false sense of security and entrench bad methodological attitudes.

To expand on this, the problem I have with the new recommendation is that it maintains the exact same conceptual framework of statistical significance testing that has been a major driver of the replicability crisis.

In statistical significance testing, a level of confidence (previously 0.05, per the new proposal 0.005) is defined, and then the question asked is: Is the probability that the null hypothesis produced the experimental results less than that confidence level? If it is, the null hypothesis is rejected, and the hypothesis is considered statistically significant.

This framework does not account at all for the question of effect size - an effect may be statistically significant, but practically speaking negligible and irrelevant. For example, often the null hypothesis is of the form "x = 0" for some model parameter x. Any such null hypothesis is almost certainly false, since the actual value of x is probably at least slightly different from 0, and so some statistically significant result can nearly always be extracted with enough data and ingenuity, even though the result may not be significant in any meaningful way in the real world.

More fundamentally, no matter what confidence level you choose, if there are enough different ways to design your experiment, it becomes highly likely that at least one of them will give a statistically significant result just by chance. And unless every single detail of the experiment is decided and registered ahead of time, it is virtually impossible to ensure, except by care and good judgement by the researchers, that no choices are made along the way that could (unwittingly) enhance the significance of the final experimental results, since there are so many different choices. This phenomenon was termed the "garden of forking paths" by Gelman and Loken, and illustrated nicely by XKCD. (This problem is inherent in any metric-based decision criterion for identifying effects, and is not specific to statistical significance testing.)

Finally, the goal and the strength of the "p < 0.005" proposal is that it will reduce the number of false positive results in the literature. However, by the same token, there will be fewer reports of "less statistically significant" results, which will cause overall problems for the science, due to the "file-drawer" effect. The problem is that the vast majority of negative results are never reported, which gives a biased picture of the state of knowledge. By lowering the p-threshold, the file-drawer bias will be even greater, since fewer results, with only lower p-values, are published.

So how should we proceed?

There is no simple answer to this question, as there are several interrelated problems: p-hacking, effect sizes, and literature bias, as I've mentioned above, but also the question of how the incentive structures of modern science may bias the creation and evaluation of new knowledge (the file-drawer problem is an example of this). All of these factors contribute to the "reproducibility crisis", and all need to be addressed. Any simple answer to one of these, such as changing the threshold p-value, will not solve the problem, and may in fact make things worse by making it seem solved when it is not.

Any meaningful solution must involve a combination of new methodological standards, institutional structures, and incentives. Reducing the p threshold is simple and could be one useful methodological standard, though it would perpetuate problematic concepts, as I've said. Another would be introducing some notions from Bayesian statistics, such as (i) using Bayes factors, which are more easily interpreted in most situations than a p-value, as a standard reported metric, or (ii) the use of explicit priors, to make assumptions clearer, or (iii) explicit loss functions, to account explicitly for the importance of effect size and the varying weights of different kinds of errors.

It is essential that the scientific community have good incentives for publishing replications, especially those that fail, and for publishing negative results. (And there cannot be penalties for early-stage researchers for doing so!) There need to be more and better institutional structures for preregistering studies, like clinicaltrials.gov, but set up with the appropriate community standards and consequences, so that they actually work.

Just as during the early and laborious development of effective scientific practices and institutions several hundred years ago, there is no royal road to solving the crisis. A great deal of hard work lies ahead of us, but past experience has shown that it will (eventually) pay off in more solid understanding of our world.

Multa Verba

Friday, July 28, 2017

Perverse Incentives from the New "p < 0.005" Proposal

Wednesday, July 26, 2017

New "p < 0.005" Standard Considered Harmful

Blog Archive