The one thing we should NOT conclude from the Open Science Collaboration’s replication study

The authors of the replication study that was published in Science this week added some niceties in their paper, stating:

It is also too easy to conclude that a failure to replicate a result means that the original evidence was a false positive. Replications can fail if the replication methodology differs from the original in ways that interfere with observing the effect.

Some of the responses to the study seem to cling to that passage of the paper, apparently in the hope to explain the low number of successful replications (some 35 of 97) in said study.

For instance, the board of the German Society of Psychology (DGPs) stated that (translated from German):

Such findings rather show that psychological processes are often context-dependent and their generalizability must be further investigated. The replication of an American study might render different results than if it is run in Gernany or Italy (or vice versa).

A similar context argument is made in this post of the NY Times.

When a study should replicate

Good scientific conduct requires that the methods part of a paper state all details necessary to redo the study. In other words, the methods section reflects what the authors deemed relevant to observe the reported effect. The Collaboration not only used the methods parts of the studies they tried to replicate, but also asked the original authors for equipment etc. Therefore, robust effects most likely had a good chance to be replicated.

One can ask what kind of effects we are producing with our science if we expect that the country in which the study is run will affect the significance of the result. (While this might be sensible for some social psychology, it should usually not be true in cognitive science.) More generally, by stating that our effects will be eliminated by small, unknown differences from one lab to another, we basically say that we have no clue about the true origin of those effects. It is my utter hope that most labs do not take on this kind of thinking…

The wrong conclusion

The above-mentioned post by the NY Times was titled Psychology is not in crisis. One can argue about the word crisis (see post by Brian Earp). But the title basically suggests that there’s nothing to worry about. Non-replication is just bad luck, due to weird little blurps in the lab context, and will inspire us to new research. I think that is the one conclusion we should not draw from the Collaboration’s study.

The better conclusion: a kick in the behind

I posted about some of the reasons for non-reproducibility yesterday: underpowered studies, file-drawer publication behavior, selective reporting, and p-hacking. All of these problems have been known for some time, but the scientific community is slow to adjust its culture to attack them.

Rather than blaming the results of The Collaboration’s study on small differences between lab setups and the like, we should face the imperfections of our current research culture: success on science depends on publishing lots of papers; thus, studies are done fast, and results are blasted out as fast as possible; publication is usually possible only with significant results.

Some things that can be done

So, rather than leaning back and doing business as usual, let’s think about what can be done. There are numerous initiatives and possibilities. Most of them require work and are still unpopular. But this week’s replication study could give us some momentum to get going. Here’s what my team came up with in today’s lab meeting:

  • Pre-register your study with a journal, so that it will be published no matter what the result. (We discussed that this is not always feasible, because with complex studies, you might have to try out different analysis strategies. Nevertheless, pre-registration should work well with many studies that use previously established effects and paradigms.)

  • Replicate effects before publishing. (The “but…”s are obvious: costly, time-intensive, not attractive in today’s publish-or-perish world.)

  • Publish non-significant results. (Can be difficult. Very.)

  • Calculate power before doing experiments.

  • Inspect raw data and single subject data, not just means.

  • Publish which hypotheses were post-hoc, and which ones existed from the start.

  • Publish data sets.

  • Use Bayesian statistics to accumulate evidence over experiments.

If every lab just started with one or two of these points that they aren’t yet practicing, we’d probably end up with a better replication success score next time around.