Have you ever wondered what percentage of your scientific hypotheses are actually correct? I do not mean the rate of statistically significant results you get when you dive into new experiments. I mean rather the rate of the hypotheses that were confirmed by others, or that postulated a drug that was in fact effective in other labs or even patients. Nowadays, unfortunately, only very few studies are independently repeated (more on that later) and even therapies long established are often withdrawn from the market as ineffective or even harmful. You can only hope to approach a rate of “success”, and that is exactly what I will now attempt to do. You are wondering why I am posing this apparently esoteric question: It is because knowing approximately how high the percentage of hypotheses is that actually prove to be correct would have wide-reaching consequences for evaluating research results for your own results, and for those of others. This question has an astonishing but direct relevance in the discussion relating the current crisis in biomedical science. Indeed, a ghost is haunting biomedical research!
It is becoming increasingly certain that most of the study results in biomedicine and psychology cannot be reproduced. According to a recent Nature survey, 90% of scientists believe that we are in the middle of a „reproducibility crisis“. Certainly, I am convinced of it. But what does reproducibility in biomedical research actually mean? Replication of the p-value, effect size, subjective assessment of experts as to whether a „replication“ has been achieved? How many results can or should really be reproducible?
The „crisis“ started with two articles from the pharmaceutical industry. Only in 10-20 % of the studies repeated by scientists at Amgen and Bayer were able to duplicate results from university laboratories which had for the most part been published in high-ranking journals. Not without reason were the authors criticized for neither having revealed their criteria for a successful replication nor identified the studies that they had investigated. In addition, industry had a problem with results they received from the universities. To some researchers in academia it soon became clear why these replications had been doomed to failure: postdocs not good enough for a career in academia were ending up in industry, and were incapable there as well: non-replication as a result of non-competence. In the meantime, however, the results of a number of very well planned, systematic initiatives in academia (e.g. in psychology or cancer research) were able to replicate only a disappointingly small portion of the results of studies published in high-ranking publications. Now we even read in the papers that science is in a crisis. A high-ranking administrator of the DFG commented: of course, it is true that 80% of findings cannot be reproduced, but the remaining 20% were all sponsored by us! If it were only so easy.
How many results should actually be reproducible to satisfy us: 80, 90, or even 100%? This is where things start get interesting, if unfortunately also a bit complicated. For to get anywhere at all in this we need some basic statistics and a pinch of epistemology! In 2005 John Ioannidis posited the preposterous (and up to now unrefuted) claim that most published results of biomedicine must be wrong, and so… also not reproducible. His arguments: for one thing low quality in study design, analysis and reporting, hence distortion (bias) in the results. The list of problems is long, and includes failure to use blinding and randomization, selective choice of data as well as failure to publish negative results. His other argument: low statistical power due to inadequate number of subjects (patients, animals, dishes…). Remember that power describes the probability with which correct hypotheses can be confirmed in an experiment. That both bias and inadequate power are common and lead to inflation of false positive results and exaggerated effect sizes has in the meantime been verified by meta-research. I am convinced that this is a very important systematic (and systemic) source of non-reproducibility. And, by the way, a source of great difficulties in the translation of fantastic new treatment strategies in animal models into therapies effective in humans.
But back to our fundamental opening question: How many results should reproducible? In a scientific utopia where all bias has been removed, and statistical power is 100%, would all studies be reproducible? Certainly not! Does not knowledge move forward at least partially through the refutation of what was hitherto recognized? Hans-Jörg Rheinberger cites the “differential replication of experimental systems” as the essential moment of scientific progress. Accordingly, in time, every result will only be “differentially” replicable, i.e. only partially replicable. Scientists must err; they are not infallible.
But how about you? How many of the hypotheses in your studies prove “correct” and should therefore be replicable? Most colleagues, after brief hesitation when this question is posed, answer with a percent figure far over 50%. After all, we are all good scientists. But would it not be tragic if a high percentage of our hypotheses were to prove correct? That would give cause to suspect that most of the hypotheses you researched were trivial! That so much of them was already accepted knowledge; that the next small step you made forward was with high certainty to be predicted. How boring!
Fortunately, we are far less than 50 % accurate with our hypotheses. Formal investigations arrived at a quota closer to 10 %. This would have far-reaching consequences. It would mean for example that at the common significance level of 5 % (p= 0.05) and the statistical power required in clinical studies (80 %, which however is hardly ever achieved in preclinical experimental studies) more than one third of all statistically significant findings are false-positive! The hopes nurtured by most experimenters in this situation, however, are in vain; they believe that they will only be wrong in a maximum of 5% of cases. What they often do not know: A p-value does not express the probability with which the tested hypothesis is true or false. This probablitiy is dependent not only on significance level but also on power, and very essentially, on the a priori probability of the hypothesis. Now, we do not know the a priori probability of our hypothesis at all — and it is certainly markedly under 100 % — for we are not infallible bores. The number of false positive results is also increased by the fact that most of the experiments in biomedicine are carried out with much lower power than 80 %. The false-positive rate is therefore presumably over 50 %. Greetings from John Ioannidis! What does this have to do with reproduction? Precisely that you cannot reproduce false positive findings – except through another false positive!
That also makes it clear that by its very essence, explorative research – unless it is dealing with trivialities – must per se necessarily generate non-replicable findings, and this, entirely without bias and with sufficient power. Could its accuracy then be significantly improved by repeating the „positive“ experiment? Unfortunately not, unless the number of subjects (experiments) is significantly increased. This unpleasant truth is also known to few: If a study with a result which is significance close to a level of 5 % (z.B. p=0.049) that is based on a true hypothesis is repeated with the same experimental setup and number of subjects, the probability to reach statistical significance again at the same level is 50 %. Whoever has understood that has to reach the seemingly insane conclusion that under these circumstances it is better to throw a coin to achieve a “reproduction” of a finding than to kill all those rats and mice.
Paradoxical as it may seem, could it be, particularly when experiments are pressing forward into the unknown, even casting into doubt hitherto textbook knowledge, that a low replication rate signals particularly „hot“, thrilling new knowledge? That there is something such as necessary, or benign non-production? I think there is, but in the current literature, in which bias and low power are rampant, it is difficult to distinguish it from “malignant non-reproduction”. To change that we must put an end to experiments with insufficient n’s, with no blinding or randomization, with cherry picking of data, erroneous statistics and no publication of neutral or negative results.
We can now draw a few simple conclusions which if implemented would have powerful effects. For one, it is high time to minimize the “malignant non-replication”. That means a lot of work. What is the situation in your area of activity? When reviewing papers, do you scrutinize carefully measures taken to reduce bias, spectacular results with only few subjects or animals, or unexplained asymmetric group sizes? Do you publish results that have not confirmed your hypotheses?
We have to live with a certain rate of non-replication. This rate would be the highest when in fact science really is at the cutting edge. It also means that we should put more emphasis on independent confirmation, as we need to find the true positives among the many false positives which we have accumulated in exploration. In a recent commentary published in Nature Jeffrey Mogil und Malcolm Macleod propose that top journals publish preclinical studies only if their spectacular and clinically important basic findings are accompanied by a confirmation!
For the improvement we need in the quality of our research, the most important conclusion from the above is perhaps that confirmation not be stigmatized as second-class research but rather supported and rewarded. In the review process, in selection and appointment commissions etc., confirmation must not be dismissed as industrious but boring addendi. High-quality confirmation and only high-quality confirmation will steer us out of the reproduction crisis. It is an intellectual challenge, methodically complex and resource-intensive.
A German version of this post has been published as part of my monthly column in the Laborjournal:
Selected literature related to this post can be found here.