I failed to reproduce the results of my experiments! Some of us are haunted by this horror vision. The scientific academies, the journals and in the meantime the sponsors themselves are all calling for reproducibility, replicability and robustness of research. A movement for “reproducible science” has developed. Sponsorship programs for the replication of research papers are now in the works.In some branches of science, especially in psychology, but also in fields like cancer research, results are now being systematically replicated… or not, thus we are now in the throws of a “reproducibility crisis”.
Now Daniel Fanelli, a scientist who up to now could be expected to side with those who support the reproducible science movement, has raised a warning voice. In the prestigious Proceedings of the National Academy of Sciences he asked rhetorically: “Is science really facing a reproducibility crisis, and if so, do we need it?” So todayon the eve, perhaps, of a budding oppositional movement, I want to have a look at some of the objections to the “reproducible science” mantra. Is reproducibility of results really the fundament of scientific methods?
Or as Jason Mitchell says, did not Thomas Kuhn in his famous essay “The Structure of Scientific Revolutions” establish that scientific progress does not happen essentially in the incremental progression of “normal science”, but rather through periodically reoccurring “paradigm shifts”? And the paradigm shift is most certainly anything other than the reproduction of what already exists! A related argument is the “triviality” of reproduced scientific results, according to which especially those findings based on what is amply known are guaranteed to be the most reproducible. And vice-versa, does successful replication mean that these results are the “true” ones? What if the original result and the reproduction are both based on the same systematic error, or are both false positive findings, simply by coincidence? And it gets even more philosophical. Some critics of emphasizing reproducibility as a scientific goal even refer to Karl Popper: According to him, hypotheses cannot be proved, only falsified. Using the example of the famous one black swan that refutes the hypothesis that “all swans are white”: If a study, that only found white swans on one lake, is successfully reproduced on another lake where there were only white swans, the hypothesis would nevertheless have been wrong. That would be particularly evident if a black swan flew by. This is what Jason Mitchel calls “the emptiness of failed replication”. The marvel of science is the discovery of something new, not dull repetition. Reproducing is not science, is what this verdict tells us!
Those critics on the other hand who consider replication experiments problematic from the get-go, get right down to it without theoretical digression: They doubt the competence of the replicators. They usually point out the hosts of doctoral students and post-docs who have been worn down attempting to establish a particular technique in their own labs. Naturally everything was replicable, the experts would say. But the existence of implicit knowledge that cannot be imparted in the methods section of articles precludes replicability. The non-replicability of the results by others proves — so it goes — only one thing:their incompetence!
And another consideration is put forward by the critics: Through proclaiming replication as the gold standard of good science, scientists whose results cannot be replicated are stigmatized. Fully independent of the details and circumstances of the replication, the result of the replication is taken as the right one. That gives rise to suspicions that somebody hasn’t been doing a clean job, might even have violated the rules of good scientific practice! Good science MUST be replicable!
Does that mean that the critics are right – is it an error to raise reproducibility of research to the status of a principle, to reward and even hand out funding for it? Certainly not. Nevertheless, I recommend taking the arguments seriously and to consider the complexities of the replication issue.
For starters, concepts of “reproducibility” are often tangled up. Reproducibility of methods, of the results, of the conclusions drawn from results (inferential reproducibility), strict replication etc., must be addressed separately. Do we mean a repetition of the effect size, of the p value, of statistical significance in general? And naturally, reproducibility is context-dependent.That’s where “implicit knowledge” fits in, but more important by far is the robustness of results, that is, their external validity. Hanno Würbel pointed out the paradox of the standardization fallacy: The desire for reproducibility often leads to the call for more standardization. This, however – and herein lies the paradox – is a fallacy, for with a higher standardization, reproducibility in other laboratories will diminish! In 1935 Ronald Fisher, the founding father of the frequentist theory of probability which we so revere, formulated it this way: “A highly standardized experiment renders direct information only with regard to the narrow realm of conditions that would be achieved by standardizing. Standardizing thus does not strengthen but rather weakens our conclusions from the results compared to the practice of varying conditions.” It is precisely in bio-medical research that waiving standardization in this way, or even introducing variation on purpose,improves external validity: When a result from a mouse in Boston cannot be repeated in the genetically identical mouse in Berlin, that does not contradict the correctness and quality of the findings in Boston. It does however cast doubt on its translatability to humans.
The replication of one’s own findings and those of others is science. For one thing, and herein lies the misunderstanding in interpreting Thomas Kuhn, both “normal science” (i.e., what most of us do), and the research that leads to paradigm shifts (i.e., what serendipity and brilliant researchers produce), must rest decisively on results that are replicable. Both reproduction and possible non-reproduction lead to scientifically relevant results. A competent reproduction can strengthen a hypothesis, particularly if carried out successfully with variation of methodological details.If the design of reproduction is changed so that alternative methodological approaches are consciously chosen (e.g. instead of a Knockout mouse, the manipulation of the particular gens of interest by means of an RNA interference), which is called triangulation, this promises potentially even more robust results. On the other hand, a non-reproduction can lead to new knowledge about the modifying factors.
In no case should non-reproduction lead to stigmatization. Myriad factors can have caused it, a number of which I have suggested for the replication-critics. And now another warning to those who find the debate irrelevant, since they “have of course always replicated their own results”. An effect that was significant just barely on a level of only 0.05 and does in fact constitute a “true” result can be repeated in strict replication (same experiment, same case number etc.) with only 50% probability as significant. The replication experiment could be replaced by throwing of a coin, saving animals and resources (see my previous post)!
The goal of science is not replicability; it is all about new knowledge. But new knowledge must be reproducible. Karl Popper again: “All occurrences that are not reproducible are excluded from science.” The examination of exciting hypotheses at the forefront of knowledge generates necessarily a lot of false positive findings, even in research of the highest quality. These false positives however must be weeded out in subsequent, competent experiments. Reproduction is therefore a noble, highly scientific activity. The dichotomous question: “Is a result reproduced” is unsuitable, because reproducibility does not follow a simple yes/no scheme. It does honor to a scientist when others attempt to reproduce his/her results,because this means that they are important. When they are not reproduced, that is when the science really starts, for many questions must then be answered:Is there an effect, and only the p value wrong? What will happen when I combine the original experiment and replication in a meta-analysis? Is there interesting biology in this? Or a yet undiscovered error? Etc…
The reward should go to those whose results are worthy of trying to replicate, and to the scientist who performs such experiments. And that only works when the methods and results of studies are described so thoroughly that you can actually cook that recipe according again!
A German version of this post has been published as part of my monthly column in the Laborjournal: