- Don’t Strengthen Statistical Significance—Abolish It, by Shlomo Engelson Argamon
- Abandon Statistical Significance, by Blakeley B. McShane, David Gal, Andrew Gelman, Christian Robert, Jennifer L. Tackett
- Justify Your Alpha: A Response to “Redefine Statistical Significance”, by Daniel Lakens, et al.
- Abandon Statistical Significance (blog post), by Andrew Gelman
- Remove, rather than redefine, statistical significance, by Valentin Amrhein and Sander Greenland
- Time to Abandon Statistical Significance, by Lars P. Syll
Tuesday, October 17, 2017
Calls to Jettison "Statistical Significance"
I recently wrote an opinion piece for American Scientist's Macroscope blog, arguing (inter alia) that the notion of "statistical significance" harms the epistemology of science, and should therefore be jettisoned. After I'd submitted the piece, I saw other articles, letters, and posts saying the same thing, arguing from a number of technical and sociological bases. I will gather here, to the best of my ability, a list of these articles as a resource for the community. If you know of others that I've missed, please add them in the comments.
Friday, July 28, 2017
Perverse Incentives from the New "p < 0.005" Proposal
Here is a short addendum to my post of yesterday, commenting on why I think that the proposal recently mooted to require a p < 0.005 for "statistical significance" is misguided and likely problematic.
There is another problem with that (admittedly well-intentioned) proposal, which arises because wide-spread implementation of the proposal would create a perverse incentive which could easily degrade the quality of the scientific literature by much more than it improves it.
As the authors acknowledge, lowering the alpha needed to declare "significance" would require larger sample sizes, and so data collection would entail greater difficulty and expense. They estimate that samples would need to be 70% larger - that may well be for studies of a single variable, though research that involves interaction effects would likely need a greater increase in the amount of data. Regardless, the manpower and expense would increase notably. (The authors see this as a potential benefit, in that "considerable resources would be saved by not performing future studies based on false premises.")
However, this extra work and expense creates a perverse incentive for researchers. The more you pay (in time, effort, money, etc.), the more you want to recoup your costs by producing publishable research. The harder such results are to produce, the harder you will look for them, after you have sunk the costs of getting the data.
Now, this is not a bad thing, if there is no way to cheat unwittingly. (Let's ignore entirely the possibility of fraud.) But if it is still possible (as it is today) to unknowingly p-hack, traipse through the garden of forking paths, HARK, etc., raising the cost of gathering a dataset will inevitably lead to a rise in p-hacked, forking pathed, and HARKed results.
The very change that was meant to improve the quality and integrity of the research literature will rather act to degrade it.
It seems to me that this underscores the fundamental need to address how incentives interact with methodologies and standards. Otherwise we are, at best, spitting in the wind, and at worst, using gasoline to put out the fire.
There is another problem with that (admittedly well-intentioned) proposal, which arises because wide-spread implementation of the proposal would create a perverse incentive which could easily degrade the quality of the scientific literature by much more than it improves it.
As the authors acknowledge, lowering the alpha needed to declare "significance" would require larger sample sizes, and so data collection would entail greater difficulty and expense. They estimate that samples would need to be 70% larger - that may well be for studies of a single variable, though research that involves interaction effects would likely need a greater increase in the amount of data. Regardless, the manpower and expense would increase notably. (The authors see this as a potential benefit, in that "considerable resources would be saved by not performing future studies based on false premises.")
However, this extra work and expense creates a perverse incentive for researchers. The more you pay (in time, effort, money, etc.), the more you want to recoup your costs by producing publishable research. The harder such results are to produce, the harder you will look for them, after you have sunk the costs of getting the data.
Now, this is not a bad thing, if there is no way to cheat unwittingly. (Let's ignore entirely the possibility of fraud.) But if it is still possible (as it is today) to unknowingly p-hack, traipse through the garden of forking paths, HARK, etc., raising the cost of gathering a dataset will inevitably lead to a rise in p-hacked, forking pathed, and HARKed results.
The very change that was meant to improve the quality and integrity of the research literature will rather act to degrade it.
It seems to me that this underscores the fundamental need to address how incentives interact with methodologies and standards. Otherwise we are, at best, spitting in the wind, and at worst, using gasoline to put out the fire.
Wednesday, July 26, 2017
New "p < 0.005" Standard Considered Harmful
Several days ago, an article preprint was posted on PsyArXiv (to appear in Nature Human Behavior), arguing that a significant step towards solving the reproducibility crisis in many fields will be to reduce the standard significance threshold to require p < 0.005 (rather than the current standard of 0.05) for a result to be termed "statistically significant". Results with higher p values could only be termed "suggestive" under this proposal.
I, along with a number of other scientists, was recently interviewed for a brief article in Nature | News which clearly, and concisely, sets out arguments for and against this proposal as a way to address the reproducibility crisis. I'd like to expand here on my very compressed comments as quoted in that article.
On the one hand, the proposal to reduce the "significance threshold" to 0.005 will indeed be helpful, in that the change will reduce the occurrence of false positives, and thus enhance the likelihood that effects termed "significant" will be reproducible, and so reduce the number of non-existent effects reported in the literature as real.
However, the proposal does nothing whatsoever to change the fundamental problem, which is the entire framework of judging experimental results by a binary criterion of "significance". There are two key problems with the regnant view of "statistical significance":
- It does not account for effect size - a large effect with less-significant p may have much more "real-life significance" than a tiny effect with more-significant p, and
- It is very easy to (even inadvertently) "p-hack" and find putatively significant effects due to the very large number of possible parameter settings for any experiment in a sufficiently rich and theoretically-underdetermined domain.
Given that, this new recommendation may even do significant harm, in that improving research outcomes somewhat (by lowering false positive rates) without addressing the fundamental problem could lead to a false sense of security and entrench bad methodological attitudes.
To expand on this, the problem I have with the new recommendation is that it maintains the exact same conceptual framework of statistical significance testing that has been a major driver of the replicability crisis.
In statistical significance testing, a level of confidence (previously 0.05, per the new proposal 0.005) is defined, and then the question asked is: Is the probability that the null hypothesis produced the experimental results less than that confidence level? If it is, the null hypothesis is rejected, and the hypothesis is considered statistically significant.
This framework does not account at all for the question of effect size - an effect may be statistically significant, but practically speaking negligible and irrelevant. For example, often the null hypothesis is of the form "x = 0" for some model parameter x. Any such null hypothesis is almost certainly false, since the actual value of x is probably at least slightly different from 0, and so some statistically significant result can nearly always be extracted with enough data and ingenuity, even though the result may not be significant in any meaningful way in the real world.
More fundamentally, no matter what confidence level you choose, if there are enough different ways to design your experiment, it becomes highly likely that at least one of them will give a statistically significant result just by chance. And unless every single detail of the experiment is decided and registered ahead of time, it is virtually impossible to ensure, except by care and good judgement by the researchers, that no choices are made along the way that could (unwittingly) enhance the significance of the final experimental results, since there are so many different choices. This phenomenon was termed the "garden of forking paths" by Gelman and Loken, and illustrated nicely by XKCD. (This problem is inherent in any metric-based decision criterion for identifying effects, and is not specific to statistical significance testing.)
Finally, the goal and the strength of the "p < 0.005" proposal is that it will reduce the number of false positive results in the literature. However, by the same token, there will be fewer reports of "less statistically significant" results, which will cause overall problems for the science, due to the "file-drawer" effect. The problem is that the vast majority of negative results are never reported, which gives a biased picture of the state of knowledge. By lowering the p-threshold, the file-drawer bias will be even greater, since fewer results, with only lower p-values, are published.
So how should we proceed?
There is no simple answer to this question, as there are several interrelated problems: p-hacking, effect sizes, and literature bias, as I've mentioned above, but also the question of how the incentive structures of modern science may bias the creation and evaluation of new knowledge (the file-drawer problem is an example of this). All of these factors contribute to the "reproducibility crisis", and all need to be addressed. Any simple answer to one of these, such as changing the threshold p-value, will not solve the problem, and may in fact make things worse by making it seem solved when it is not.
Any meaningful solution must involve a combination of new methodological standards, institutional structures, and incentives. Reducing the p threshold is simple and could be one useful methodological standard, though it would perpetuate problematic concepts, as I've said. Another would be introducing some notions from Bayesian statistics, such as (i) using Bayes factors, which are more easily interpreted in most situations than a p-value, as a standard reported metric, or (ii) the use of explicit priors, to make assumptions clearer, or (iii) explicit loss functions, to account explicitly for the importance of effect size and the varying weights of different kinds of errors.
It is essential that the scientific community have good incentives for publishing replications, especially those that fail, and for publishing negative results. (And there cannot be penalties for early-stage researchers for doing so!) There need to be more and better institutional structures for preregistering studies, like clinicaltrials.gov, but set up with the appropriate community standards and consequences, so that they actually work.
Just as during the early and laborious development of effective scientific practices and institutions several hundred years ago, there is no royal road to solving the crisis. A great deal of hard work lies ahead of us, but past experience has shown that it will (eventually) pay off in more solid understanding of our world.
Saturday, April 8, 2017
Initial Linguistic Analysis of the Shadowbrokers Texts
This is a repost of my forensic linguistic analysis of "The Shadowbrokers texts", as posted on the Taia Global blog on August 18, 2016.This is an initial linguistic analysis of the texts from “The Shadowbrokers”, as posted on Pastebin, taken from the tumblr account. This is a qualitative analysis, looking at patterns of grammatical and orthographic errors, to examine the question of whether the author of the text is a native speaker/writer of US English. A quantitative analysis would help firm up and estimate the reliability of our conclusions. This analysis assumes that all the texts were written by a single individual.
There are a number of grammatical errors that are not usual in native speaker US English:
- Omission of definite and indefinite articles (“a” and “the”)
- Omission of infinitive “to” (e.g., “I want get” instead of “I want to get”)
- Omission of modal verbs “should” and “must” and auxiliary verb “will”
- Elision of “it” in “it is ...”
- Use of progressive form “is Xing” instead of present or past tense form “X” (e.g., “He is breaking” instead of “he breaks” or “he broke”)
- Use of “are X” instead of “are Xing” or “X” (“they are go” instead of “they are going” or “they go”)
- Tense confusion – use of base verb form instead of past tense
Evidence that the author is a native speaker trying to appear non-native:
- Spelling. The spelling is entirely correct throughout, including some long and complex words such as “dictatorship”, “prostitutes”, and “consolation”. If this had been achieved through the use of spell-checking software, we would have expected to see at least one “Cupertino” (choice of a correctly-spelled but contextually wrong word).
- Inconsistent errors. Grammatical errors such as omitting the infinitive “to” or using “is breaking” to mean “breaks” result, in a non-native writer, from deeply held intuitions about how grammar works. The fact that errors 2, 3, 5, and 6 all occur inconsistently (they occur a majority of the time, but not by much) indicates that someone was inserting errors, rather than making them naturally.
- Mutually inconsistent errors. Errors 5 and 6 are odd together – if the writer knows about the progressive (-ing) form, then why do they use it only sometimes, when using the auxiliary “is” or “are” with the verb?
- Grammatical errors in idioms. There are a number of idioms that would be surprising for a low-skilled non-native speaker to use, and some of them are used with grammatical errors that a skilled English speaker would be unlikely to make. The most reasonable explanation, then, is that the errors were inserted by a native speaker after writing the idioms. Examples include:
- “or [the] bid pump[s] [the] price up”
- “bidding war”
- “top friends”
- “go bye bye”
- “where [does that] leave Wealthy Elites”
In the (unlikely) event that the writer is, in fact, not a native English speaker, their native tongue is much more likely to be Slavic (e.g., Russian or Polish) than either Germanic or Romance.
Sunday, January 29, 2017
Stop the Overheated Rhetoric; the Actual Executive Order is Bad Enough
I just read the text of the recent executive order that supposedly "Bans Muslims" and some of the relevant statutes. Unsurprisingly, much of the overheated rhetoric and clickbait headlines (even from reputable news outlets) are inaccurate. I believe it is essential, especially if you oppose the order, to be as precisely accurate as possible when writing about it. This is my analysis based on reading the original documents - I would welcome any corrections if I got anything wrong.
First, we must distinguish between the two main elements of the order. One comprises restrictions on entry to the US of certain visa holders, both temporary visas and immigrant visas ("green cards"). The other comprises restrictions on admission of refugees. These two sets of restrictions should not be confused, as they are not the same and have different effects.- The order does not "ban Muslims". It stops entry of citizens of any of a set of countries, most of whom are Muslim, and does not affect entry for most of the world's Muslims.
- The order does not list a specific set of countries to be affected, but rather refers to the list of countries which are not eligible to be part of the visa waiver program, per 8 U.S. Code § 1187. As far as I can tell, this list came from the Obama administration. As far as I can tell, there is no actual evidence that Trump picked a specific list of countries whose citizens he wanted to ban, contra many memes out there. The supposed link to his business interests is pure speculation. Indeed, there are other Muslim-majority countries where he does not have business interests which are not on the list.
- The order is being applied to green card holders, that is, legal permanent residents (LPRs). (Evidently, DHS initially did not apply it to LPRs, but the White House overruled them.) This is unconscionable, and perhaps illegal. Indeed, the US insists that green card holders maintain continuous residence in the US - they cannot be out of the US more than six months without potentially losing their status. It is an odious policy, and bad for the US in many ways, to keep out someone whose home, family, and livelihood are all legally in the US, perhaps for many years.
- A great many talented and productive academics and graduate students here will now be unable to travel abroad, for fear of not being able to return - indeed, many have already been caught abroad by this order and cannot return to their jobs or their studies. This is real human suffering, and also highly negative impact on US science and technology. Harming US research capabilities will certainly not help increase American greatness.
- It is also entirely unclear to me on what legal basis people holding valid visas can simply be barred from entry for no reason other than their country of citizenship, without revoking their visas (this may depend on the visa type). Issuance of such visas and the legality of entry to the US is governed by Congressional statute. It would seem therefore to fall within their purview. (Alas, Congress has allowed the Executive to appropriate rather too much of its power in recent years and I don't expect them to do much now.)
- The immediate application of this order within hours to people who had no warning, who were en route to the US with valid visas, who live in the US (whether as legal residents, as students, etc.), does nothing to improve the security of the country. It does, however, create an enormous amount of human suffering.
- The limit of 50,000 refugees per year is well within historical norms before 2016. It is therefore not outrageous, but certainly can be argued with, on the merits.
- The prioritization of people persecuted for being members of minority religions is entirely proper and long overdue.
- The suspension of the US Refugee Admissions Program (USRAP) is extraordinarily overbroad for its purpose, and will cause great human suffering, especially as applied to people already in transit. The clause enabling case-by-case refugee admission (jointly by the Secretaries of State and Homeland Security) is vague, and does not provide any sort of meaningful allowance for refugee admission.
- The idea of reviewing policies and procedures for immigration and refugee screening is excellent. However, to suddenly and very broadly stop all entry to the country for huge numbers of people is absurd, and clearly not well thought out. The chaos in airports today bears this out.
- Promulgating poorly-thought out orders with little-to-no interagency or legal review, and no pre-implementation planning, that then require multiple iterations of executive clarification and changes in implementation, is a recipe for instability and chaos. Such instability is, in my opinion, likely to reduce the security of the United States, possibly dramatically so.
Subscribe to:
Posts (Atom)