I suspect if you take away tenure being based upon publication, you will find that many statistical measures become more honest. You can abolish statistical significance, but it won't stop the abuse of knowledge, which is the real problem here anyway.
Pre-registration is becoming more popular, and I think it will have a massive positive effect w.r.t. abuse of low-grade statistical tests and thresholds.
Could you expand on what you mean by the 'abuse of knowledge'?
I agree that the focus on this metric negatively influences research outcomes, which extends to university structuring, but I'd like to hear your thoughts on how this extends to abuse of knowledge in general.
> Could you expand on what you mean by the 'abuse of knowledge'?
A nice way of saying "lying". Learning how to game the statistics (e.g. publish the 20th experiment that showed significance but fail to mention the other 19).
Sometimes I wonder if the discussion about p < 0.05 has diverged a bit from practical considerations. In my field (population genetics and bioinformatics) for instance, I'm not sure any current journal would reject a paper where primary result has p=0.051 but accept an (almost) identical paper with p=0.049. Most papers seem to involve many separate analyses that together tell some story, and that story may even be an interesting negative result (where p >> 0.05).
Whether or not statistical significance is a useful concept at all is a separate question, but I suspect the discussion of whether the threshold of 0.05 is useful might be out of touch with actual practice.
I don't think the article suggests dispensing with the threshold, it suggests mainly that the language used to treat p < 0.05 is disingenuous and should be reconsidered:
"We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term “statistically significant” entirely."
From the discussion on this I've read, I think a good direction would be to consider statistical tests like this as simply not "publishable" at all the way we currently think of it.
That is, if you have a theory about how a Gene relates to height on tomatoes, and you do a test, that test would show you you're likely on the wrong track if it falls below some p value, but the only thing it tells you by being above is that "there may be something here."
I think this is true for many fields with a replication crisis. The problem isn't statistical, the problem is no theory. If you have a functional theory there's all kinds of things you do to gain confidence in it, and mostly those will contribute to the ability to predict statistical results, but that is completely different on kind than sending out a survey and noting that question 2 and 6 are statistically correlated.
When a field thinks that the kind of early suggestive work like this is worth talking about, they should probably just talk about it in conferences and similar venues, rather than "publish" it where journalists will pick it up in a "science shows" story that 95% (lol) of the time turns out to be wrong.
In other words, I think it is fine that fields talk about early non-theory results -- that can be interesting for specialists to advance faster. "Publishing" this mostly-going-to-be-wrong stuff is leading to confusion among the public about what the scientific process demands and how trustworthy it is. That is not a good outcome in my opinion.
Here's a good example, take a look at the recently published articles in PloS Computational Biology: https://journals.plos.org/ploscompbiol/ Just scanning through them... there aren't really any that are a simple "we made a single hypothesis, performed a significance test, and because p < 0.05 we attempted to publish it". In my experience that's just not the usual way science is actually performed (but my experience is very limited to certain branches of biology). I don't mean to say that p-values aren't used at all, just that their application seems to be limited and used mostly to bolster very specific sub-arguments buried in a larger story. I guess the story is that the unit of work that a particular journal article represents is often the union of many statistical tests/ hypotheses / models / simulations / etc that together form a possibly-compelling story about how something works. Not really sure if that's better or worse from a statistical sanity perspective...
One very important thing to know about the 0.05 threshold, and which I did not found in this thread, is that the ideal p-value for a problem is a function of the number of samples (and the effect size but it as a lesser impact).
0.05 is way to stringent if you have 10 samples and way too lenient if you have 1 million samples.
But, by force of convention, everyone is using 0.05 (a value suggested by Fischer when basically all datasets were small) independently of their sample size and in a world were we are sometimes reaching dataset size that would have been inconceivable when the threshold was suggested.
It is not a function of only the number of samples. It is also a function of how costly are false positives (type I error) and false negatives (type II error). That is (from my understanding) the paper's main point.
But then you have to be able to calculate how costly is a type I and a type II error! That's seems a relatively straightforward question for a business (for example in A/B testing), but how do you measure that cost in academia?
I think this would only introduce confusion and another variable for p-hacking.
You can also use the apriori probability of a positiv, instead of a cost, which can be roughly deduced from existing litterature.
The strenght here is that you get rid of an arbitrary decision (p-value) and instead use quantities that can be measured and critiqued by a skeptical reviewer.
But, in my experience with small data, the impact of the size of the dataset dwarfed the impact of the cost/probability.
However "1/20 chance that, if the hypothesis is false, our research confirms it anyway" is not a good choice of somewhere for any important research that can affect people's lives.
This debate isn't that simple. The other side of this isn't that everything magically just gets more accurate. The cost will be withholding true discoveries from the medical community for longer.
Imagine you've done a study of something else and you accidentally discover a (properly FDR controlled) correlation between some drug and heart attacks, significant at 0.03 and with no big contrary prior.
Should you publish or wait 5 years for a follow-up study to complete?
What if three different groups stumble across this same finding? Should they all publish and maybe someone will realize 'shit this drug is killing people' or should they all wait to hit some other higher standard of significance?
The point is theres always a balance between specificity and sensitivity, a tradeoff in terms of costs.
I'd personally be happy with keeping 0.05 as the threshold for 'probably something here'. The real issue are about publication bias, incentives, naive interpretation of published work 'heres one study in Nature so it must be true'. I don't see any other purely statistical change, Bayesian (essentially has all the same problems, aside from taking into account priors) or whatever, that will solve these in any way that won't come without unacceptable sensitivity cost.
There have been some recent proposals to lower it getting lots of traction. Sure if you want to decide whether to publish based on p-value alone you have to draw a line somewhere, but 0.05 isn't necessarily the right place. And you don't make publishing decisions based on p-values alone.
I think you might be conflating two related-but-different arguments.
There is consensus that a p-value of 0.95 is reasonable. No one is arguing against that parameter in particular.
The argument is rather against the idea of statistical significance itself, because it's relatively easy to cheat if you are dishonest. Setting p to 0.01 won't change that.
Finally, and this is my personal opinion, there is "consensus" for abolishing statistical significance the same way there's "consensus" that Python is the best programming language. Scientists are more or less in agreement that it could be better, but "mildly displeased scientists call for unclear improvement" is not as good a headline.
There is not consensus that the current threshold is reasonable. There are many big names arguing against that parameter in particular.
See e.g. Redefine statistical significance. The authors list is practically a who's who https://psyarxiv.com/mky9j
>One Sentence Summary: We propose to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005.
Victor Coscrato, Luís Gustavo Esteves, Rafael Izbicki, Rafael Bassi Stern — Interpretable hypothesis tests (2019)
Abstract:
Although hypothesis tests play a prominent role in Science, their interpretation can be challenging. Three issues are (i) the difficulty in making an assertive decision based on the output of an hypothesis test, (ii) the logical contradictions that occur in multiple hypothesis testing, and (iii) the possible lack of practical importance when rejecting a precise hypothesis. These issues can be addressed through the use of agnostic tests and pragmatic hypotheses.
Note that this enables acquiring one of the Holy Grail of Statistics, namely, controlling Type I & II errors simultaneously.
In particle physics there is the concept of "look elsewhere" effect, precisely to take into account that if you look for a signal, for example of a particle of any mass in some range, there is the possibility that just by chance you find some statistical deviation at some mass value.
It is very different to confirm a prediction (i.e. to look for a particle with a precisely predicted mass), than to fish for some unexpected signal in your data.
In some cases Economics could do the same: Looking for an effect in any age range could be post-processed to take into account that you are looking into many age groups.
That's called the multiple comparaison problem in statistics and it is both well kown and compensated for in most studies (the hard part being not to have too mauch false negatives in an effort to keep the quantity of false positives constant) :
https://en.wikipedia.org/wiki/Multiple_comparisons_problem
Who is Timothy Taylor? (The 'about me' on his Blogger page is blank, and Googling doesn't turn up results that are obviously about him.) This refrain is so well worn by now that it seems that one really ought to have something fundamentally new to say, or be so much of a heavy weight that one's own opinion might be enough significantly to swing the pendulum, before hoping to have another repetition of it make any significant difference. (On the other hand, I guess it's just a blog post, so I shouldn't spend too much energy ranting about how someone uses his own blog.)
This is throwing the baby out with the bath water. Let's redefine it so it's useful again and reform academia. I'm running into more and more people who use headlines like this to ride bikes with no helmets.
There are a few more listed in the article and elsewhere:
* pre-registered experiments
* listing number of regressing models
* P-values with no significance declaration
Maybe it would be a good idea to have an exception where an academic paper with a p-value that's right on the borderline can get published if he or she can find someone at another institution to replicate their results. At the very least it would ensure that someone else is reproducing their methodology.
There's a potential for abuse (as always), but perhaps you could curb it by mandating that you can only use the same co-submitter once. The other (likely bigger) problem would be figuring out how to get funding for it.
Full disclosure, I've never worked in a academia before, so please take this all with a grain of salt.