On March 17, 2015 five panelists from cognitive neuroscience and psychology (Sam Schwarzkopf, Chris Chambers, Sophie Scott, Dorothy Bishop, and Neuroskeptic) publicly debated “Is science broken? If so, how can we fix it?” . The event was organized by Experimental Psychology, UCL Division of Psychology and Language Sciences / Faculty of Brain Sciences in London.
The debate revolved around the ‘reproducibility crisis’, and covered false positive rates, replication, faulty statistics, lack of power, publication bias, study preregistration, data sharing, peer review, you name it. Understandably the event caused a stir in the press, journals, and the blogosphere (Nature, Biomed central, Aidan’s Aviary, The Psychologist, etc…).
Remarkably, some of the panelists (notably Sam Schwarzkopf) respectfully opposed the current ‘crusade for true science’ (to which I must confess I subscribe) by arguing that science is not broken at all, but rather, by trying to fix it we run the risk to wreck it for good. Already a few days before the official debate, he and Neuroskeptic had started to exchange arguments on Neuroskeptic’s blog. While both parties appear to agree that science can be improved, they completely disagree in their analysis of the current status of the scientific enterprise, and consequently also on action points.
This predebate argument directed my attention to a blog, which was run by Sam Schwarzkopf, or rather his alter ego, the ‘Devil’s neuroscientist’ for a short, but very productive period. Curiously, the Devil’s neuroscientist retired from blogging the night before the debate, threatening that there will be no future posts! This is sad, because albeit somewhat aggressively, but very much to the point, the Devil’s neuroscientist tried to debunk the thesis that there is any reproducibility crisis, that science is not self-correcting, that studies should be preregistered, etc. In other words, he was arguing against most of the issues raised and remedies suggested also on my pages. In passing, he provided a lot of interesting links to proponents on either side of the fence. Although I do not agree with many of his conclusions, his is by far the most thoughtful treatment on the subject. Most of the time I discuss with fellow scientist who dismiss problems of the current model of biomedical research I get rather unreflected comments. They usually simply celebrate the status quo as the best of all possible worlds and don’t get beyond the statement that there may be a few glitches, but that the model has evolved over centuries of undeniable progress. “If it’s not broken, don’t fix it.”
The Devil’s blog stimulated me to produce a short summary of key arguments of the current debate, to organize my own thoughts and as a courtesy to the busy reader.
(I have italicized arguments against the notion that science is broken or has a problem, my responses are in regular type)
1. “Replication is not science”
“The likeliest explanation for any failed replication will always be that the replicator bungled something along the way”.
This is an interesting argument (in reality it is nothing but an allegation), because the same opponents of replication usually lament that a major problem with the discussion about non-reproducibility is that it taints scientists as well as science in general.
“In any experiment there is lot of tacit knowledge. Without it replication may fail”.
If the description of an experiment does not permit its replication, what is its use anyway?
“Replication efforts reflect a strong expectation that findings are not robust, and are the result of bad science. The psychology of failed replication is such that it suggests that the initial finding was not ‘real’.”
Unfortunately true. We should try to overcome this attitude.
“Non-replication may be false negative. The chance to replicate a ‘true’ (=non null) finding at alpha=0.05 and a power of 80% is only 50 % (see earlier post)!”
True, but this is just another illustration of why the current type I and II error levels are flawed.
“There is a strong asymmetry between positive and negative evidence. Negative evidence can never establish the absence of a phenomenon.”
In a very basic (epistemological) way this is true. But if we go by this verdict, no claim can be refuted, as unlikely or preposterous it may be. We would have missed enlightenment or the scientific revolution…
“Replication efforts are not science”.
It depends on your definition of science. If you define science purely as ‘to boldly go where no man has gone before’, this is true. But science to a large degree inherently is replication. Interestingly, elsewhere in his blog (actually her blog, because she claims to be a female Devil) the Devil’s neuroscientist is a crusader for replication, just not for the sake of it. She correctly points out that replication is at the heart of science, and that we should incorporate even more replications in our experiments. For the molecular biologist, science historian and epistemologist Hans-Jörg Rheinberger science progresses by ‘differential reproduction of experimental systems’, he thus localizes replication in the DNA of all scientific endeavors.
“As a rule, studies that produce null results including preregistered studies should not be published”(Jason Mitchel, ‘On the emptiness of failed replication’) [name corrected].
I fear many scientists think like John Mitchel. And this is indeed part of the problem. See the other arguments on this page, in particular the fact that science builds on replication. Null results are the pedestals and building blocks of the non-null results which emerge from them. It is very easy to demonstrate that in current biomedicine null results are much more likely true than non-null results (see previous post). We need to make them available to other researchers as a foundation of their research and for their scrutiny.
As a side not, and importantly, in the replication (and preregistration) discussion it is critical that we discriminate between explorative and confirmatory experiments (see previous post).
2. “Preregistration puts science in chains”
Preregistration (in particular of confirmatory studies) has been advocated (and already implemented) to prevent scientists from cherry-picking data, hiding null results, failing to employ appropriate statistical power and methodology, or reinventing the aims of studies after they have been completed to make it look as though unexpected findings were predicted. The basic concept is borrowed from clinical trials, which cannot be published in major journals without preregistration in public databases (e.g. clinicaltrials.gov).
Arguments against preregistration include:
“It limits the interpretation of data data, or makes papers ‘one dimensional’.”
“In the (conceptual) preregistration phase reviewers have nothing to judge on, hence go for reputation of the submitting team.”
“It might prevent us from correcting mistakes made along the way (because we are bound by the ‘protocol).”
“It overemphasizes ‘hypothesis testing’ (intervention vs. observation).”
A detailed account of these arguments can be found here and in ‘Pre-registration would put science in chains’ by Sophie Scott.
Most if not all of these arguments only work against a background of the scary scenario that every study (exploratory or confirmatory) would have to be preregistered to be published, and protocols would be absolutely binding. But this is absurd, nobody is actually arguing for such a model. I think preregistration has its merits when clearly defined hypotheses are tested and the replication of pivotal results is attempted. There are already excellent examples of this approach.
3. “A priori power calculation reveals unachievably large sample sizes”
Most of current biomedical research is grossly underpowered. It is fairly easy to demonstrate that false positives must abound, and null results, which are in fact more robust than those that are non-null, do not get published (see also above, John Mitchel ‘Null results should not be published’). Outside the clinical trial realm, many researchers are unfamiliar with the concept of statistical power. In addition, the majority of researchers confuse p-values with false discovery rates (see also David Colquhoun’s recent article). They believe that accepting p-values smaller 0.05 implies that the risk of erroneously accepting their hypothesis (i.e. rejecting the Null) is below 5 %, which they find soothing and acceptable. Those lucky few who know about Type II error and actually calculate power a priori often engage in what has been termed ‘sample size samba’: they set detectable effect sizes or sample sizes not according to biological reasoning (‘What effect size would be biologically meaningful?’), but so that the experiment can be performed within the limitation of available resources.
Unfortunately, besides better teaching statistics, the only remedy for underpowered experiments is more power (larger n’s, less variance).
The common argument against this is that “performing appropriately powered experiments is unpractical. More Power means more experiments, more resources (which we don’t have), and prolonged projects and PhD theses. The most sophisticated elaboration of this argument I found in the DataColada blog of Uri Simonsohn, under the heading ‘WE CANNOT AFFORD TO STUDY EFFECT SIZE IN THE LAB“.
There the point is made (and quantified with downloadable R-code) “that realistic sample sizes are too small to discriminate between d’s of 0.2, 0.5. or even 0.8, and that only if power approaches 1 (at n > 3000) can we discriminate between small, medium, or large effect sizes.”
This is certainly true. And a corollary of this is that if we work with n > 3000, besides obtaining a fairly robust estimate of effect size we will practically always reject the NULL! But then this argument is as unpractical as huge sample sizes. Although it may be desirable to have narrow confidence intervals around effect sizes, we use Power analysis to estimate sample sizes to detect an effect of a certain magnitude or greater (at given levels for alpha and beta). It is true that for most experimental setups this means that for practical reasons we are stuck with very large effect sizes. In fact, most current preclinical studies have sample sizes that allow the detection only of huge effect sizes (d > 1,5). We need to get better at this, and this means investing resources (funders!) and improving experimental designs. In addition I would argue that even under current circumstances sample size / Power calculation is a useful exercise, as it demonstrates the limitations of our experimental design and curtails over interpretation of the results.
4. “Science is always self-correcting”
A common argument against efforts to improve the quality of research is that we do not need them in the first place, as “given sufficient time and resources science will self-correct” .
Best practice examples are quoted, such as the refutation of phrenology, or as in 2014 the withdrawal of the reporting of recordings of gravitational waves in the early universe (BICEP2 experiments). Scientists know best, and should be patient. As the Devil’s neuroscientists puts it: “[…] it seems somewhat ironic that we as scientists should find it so difficult to understand that science is a gradual, slow process.”
Yes. Science is about questioning the knowledge of the day, refining or refuting conclusions and theories. But this can’t be an argument for accepting known and preventable flaws (bias, deficient statistics, incomplete reporting, etc.). In addition, in a previous post (‘Should we trust scientists’) I have argued that it may be idealistic to believe that scientists exclusively and altruistically strive to explain the world. While this may be an important motive, science is riddled with biases, conflicts, and vested interests. For example because certain types of results may help researchers to accumulate more of the currency which advances their careers. It is a non sequitur to confront the quest to improve the quality of research with an ideal of science. The argument merely contrasts current practice with an ideal.
This is a pun related to a paper by John Ioannidis (Why most research findings are false), a killer phrase meant to discredit those who argue that current biomedical research is not robust, discuss potential reasons, and propose remedies.
A related argument is to accept that 80 % or more of research findings are false, but then to turn it around and praise that this means that 20 % of our research is correct: Isn’t this a remarkable success rate?
Ioannidis and many others quantitatively demonstrate that due to correctable flaws in the system (bias, low power, etc.) we waste resources and potentially put patients in danger (forget about the 80 % figure). The argument that this isn’t such a bad track record after all stipulates that we shouldn’t worry, because history of science tells us that all theories at some point have been superseded (and thereby sometimes rejected) by the progress of future generations of scientists. Come on! The trick of this argument is to subliminally change subject, from exposing questionable research practice to epistemology.
6. “The crusaders are only tinkering with the symptoms”
Here comes another argument against focusing on irreproducibility, bias, flawed statistics, etc.:
‘The crusaders’ are stuck with the symptoms, instead of attacking the root of the problem. Interestingly, those who use this argument almost never go on and expose such a cause of the causes. It is an unfair argument for two reasons: Many of the articles concerned with the current state of biomedical research put their finger in the wound(s): pressure to publish, wrong incentives, deficient training, etc. In addition, only by exposing the flaws (symptoms) we will be able to convince all stakeholders (researchers, university administrators, editors, funders etc.) to attack the root causes. And this is what is currently happening.
7. “Scientific misconduct is the real problem”
Media attention is focused on spectacular cases of fraud. Teaching of Good scientific practice (GSP) and its written rules and regulations are strongly biased towards the prevention of falsification, fabrication, and plagiarism.
Many researchers who deny that there is a more general problem with the biomedical scientific enterprise argue that we should concentrate on preventing scientific misconduct, as misbehaving researchers taint the public perception of science and scientists.
There is nothing wrong with preventing and exposing misconduct. But while it is clear that published cases of misconduct are on the rise, it is unclear whether this is based on increased incidence, or on increased scrutiny and media attention. Scientific misconduct is just the tip of an iceberg, if anything it is a sign or symptom of something bigger. Under water linger the “gray area” of questionable practice as well as an even bigger chunk of clearly faulty practice which is sanctioned because everyone engages in it and journals publish it. In that sense ‘misconduct’ is a red herring. It distracts us from tackling more widespread practices, and the root causes (see above).