In a huge effort, the Open Science Collaboration, headed by Brian Nosek, attempted to replicate >100 studies that were published in 2008 in 3 psychology journals (see paper here). 100 experiments were finished and made it into the report. Of those, a crushingly low number revealed the effects of the first publication.
There’s some debate about the methodology of the study — see, for example, the excellent post by Alexander Etz who suggests a Bayesian approach instead of classifying each replication attempt into a success vs. failure. But, as Etz concludes, any way you look at it, a lot of studies didn’t replicate — somewhere between one and two thirds.
It’s worth noting that surprising results and difficult studies replicated less often. Thus, if we generalize the result to all of science, we might expect a higher nonsense rate from higher impact journals, which often publish unexpected findings. It might also mean that we should expect higher nonsense rates for more complex methods, as for example fMRI experiments. This is, of course, just a hunch. But estimate the cost, both in terms of time and money, of a project that tried to replicate 100 fMRI studies…
What are the reasons for the failure to replicate so many studies? There are at least 3 problems:
Publication bias and underpowered studies
Studies are usually published if they have a positive result, that is, a significant p-value. This is because
- in classical statistics, a non-significant result does not allow us to draw valid conclusions
- not finding what you attempted to find is usually not very interesting
- not finding what you attempted to find might just mean you did sloppy work
As a result, lots of studies are conducted, but never written up — the so-called file drawer problem. What gets into the journals are those attempts that were successful — but often by chance. The rest goes in the trash.
And vice versa, studies that are published often give us a wrong impression about how strong an effect really is: Because being successful in science means publishing as many papers as possible, studies often acquire only few subjects, so that studies have low power. Then, if a significant effect is found by chance, it is published. Accordingly, most replications will come out with smaller effects, or none at all (see Button et al. 2013).
Selective reporting of measurements
Because we’re so keen on finding a positive result, we often measure a lot of things at once. But given the problem of chance results, the more measures we take, the more probable it is that we find a spurious significant result. The problem becomes worse if a researcher measures many things, then picks the significant result for publication, but does not disclose that there were many other measures that did not show a significant result (see Simmons et al. 2011). This will all the more give the impression of a good result, though really it was just chance.
There are more papers that report p-values just below 0.05 than there should be (see Lakens 2015). This indicates that authors work on their data until they become significant. There are several ways to do that, for example
- cleaning data by eliminating outlier data points and outlier subjects. There are high degrees of freedom in cleaning, and no hard rules as to what should (not) be done.
- acquiring data subject by subject and stopping when the p-value is low enough (see Simmons et al. 2011). This strategy will make you stop data acquisition after you have, by chance, sampled one or several subjects with a strong effect, who push your p-value just under the 0.05 mark.
It is obvious that such results will not replicate under more controlled and constrained data acquisition and analysis.
I’ve never understood why 5% is such a magical mark. But as long as science careers rest on publishing many papers, and getting a paper published rests on passing those magic 5%, I suppose we’ll continue seeing p-hacking.
Science practices seem to be really hard to change. These problems have been known for a while. Maybe we’re gaining some momentum.
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science 349: aac4716.
Lakens D. (2015) On the challenges of drawing conclusions from p-values just below 0.05. PeerJ 3:e1142.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22, 1359–1366.
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376.