I failed to reproduce the results of my experiments! Some of us are haunted by this horror vision. The scientific academies, the journals and in the meantime the sponsors themselves are all calling for reproducibility, replicability and robustness of research. A movement for “reproducible science” has developed. Sponsorship programs for the replication of research papers are now in the works.In some branches of science, especially in psychology, but also in fields like cancer research, results are now being systematically replicated… or not, thus we are now in the throws of a “reproducibility crisis”.
Now Daniel Fanelli, a scientist who up to now could be expected to side with those who support the reproducible science movement, has raised a warning voice. In the prestigious Proceedings of the National Academy of Sciences he asked rhetorically: “Is science really facing a reproducibility crisis, and if so, do we need it?” So todayon the eve, perhaps, of a budding oppositional movement, I want to have a look at some of the objections to the “reproducible science” mantra. Is reproducibility of results really the fundament of scientific methods? Continue reading
Tuberculosis kills far more than a million people worldwide per year. The situation is particularly problematic in southern Africa, eastern Europe and Central Asia. There is no truely effective vaccination for tuberculosis (TB). In countries with a high incidence, a live vaccination is carried out with the diluted vaccination strain Bacillus Calmette-Guérin (BCG), but BCG gives very little protection against tuberculosis of the lungs, and in all cases the vaccination is highly variable and unpredictable. For years, a worldwide search has been going on for a better TB vaccination.
Recently, the British Medical Journal has published an investigation in which serious charges have been raised against researchers and their universities: conflicts of interest, animal experiments of questionable quality, selective use of data, deception of grant-givers and ethics commissions, all the way up to endangerment of study participants. There was also a whistle blower… who had to pack his bags. It all happened in Oxford, at one of the most prestigious virological institutes on earth, and the study on humans was carried out on infants of the most destitute layers of the population. Let’s have a closer look at this explosive mix in more detail, for we have much to learn from it about
- the ethical dimension of preclinical research and the dire consequences that low quality in animal experiments and selective reporting can have;
- the important role of systematic reviews of preclinical research, and finally also about
- the selective (or non) availability and scrutiny of preclinical evidence when commissions and authorities decide on clinical studies.
Recently, NIH Scientists B. Ian Hutchins and colleagues have (pre)published “The Relative Citation Ratio (RCR). A new metric that uses citation rates to measure influence at the article level”. [Note added 9.9.2016: A peer reviewed version of the article has now appeared in PLOS Biol]. Just as Stefano Bertuzzi, the Executive Director of the American Society for Cell Biology, I am enthusiastic about the RCR. The RCR appears to be a viable alternative to the widely (ab)used Journal Impact Factor (JIF).
The RCR has been recently discussed in several blogs and editorials (e.g. NIH metric that assesses article impact stirs debate; NIH’s new citation metric: A step forward in quantifying scientific impact? ). At a recent workshop organized by the National Library of Medicine (NLM) I learned that the NIH is planning to widely use the RCR in its own grant assessments as an antidote to JIF, raw article citations, h-factors, and other highly problematic or outright flawed metrics. Continue reading
Using metaanalysis and computer simulation we studied the effects of attrition in experimental research on cancer and stroke. The results were published this week in the new meta-research section of PLOS Biology. Not surprisingly, given the small sample sizes of preclinical experimentation, loss of animals in experiments can dramatically alter results. However, effects of attrition on distortion of results were unknown. We used a simulation study to analyze the effects of random and biased attrition. As expected, random loss of samples decreased statistical power, but biased removal, including that of outliers, dramatically increased probability of false positive results. Next, we performed a meta-analysis of animal reporting and attrition in stroke and cancer. Most papers did not adequately report attrition, and extrapolating from the results of the simulation data, we suggest that their effect sizes were likely overestimated.
And these were our recommendations: Attrition of animals is often unforeseen and does not reflect willful bias. However, there are several simple steps that the scientific community can use to diminish inferential threats due to animal attrition. First, we recommend that authors prespecify inclusion and exclusion criteria, as well as reasons for exclusion of animals. For example, the use of flowcharts to track animals from initial allocation until analysis, with attrition noted, improves the transparency of preclinical reporting. An added benefit of this approach lies in the ability to track systemic issues with experimental design or harmful side effects of treatment. Journal referees can also encourage such practices by demanding them in study reports. Finally, many simple statistical tools used in medicine could be adopted to properly impute (and report) missing data. Overall, compliance with ARRIVE guidelines will aid in most, if not all, of the issues inherent to missing data in preclinical research and help structure a better standard for animal use and reporting. (Click here to access the full article, and here for Bob Siegerink’s blog post about it).
Nature ran a feature on the article, which was massively covered by the lay press, in interviews, and in blogs. For example:
here is a Hungarian one
and a Spanish
and an Argentinan:
or a Swiss (Italian)
Biomedicine currently suffers a ‚replication crisis‘: Numerous articles from academia and industry prove John Ioannidis’ prescient theoretical 2005 paper ‘Why most published research findings are false’ (Why most published research findings are false) to be true. On the positive side, however, the academic community appears to have taken up the challenge, and we are witnessing a surge in international collaborations to replicate key findings of biomedical and psychological research. Three important articles appeared over the last weeks which on the one hand further demonstrated that the replication crisis is real, but on the other hand suggested remedies for it:
Two consortia have pioneered the concept of preclinical randomized controlled trials, very much inspired by how clinical trials minimize bias (prespecification of a primary endpoint, randomization, blinding, etc.), and with much improved statistical power compared to single laboratory studies. One of them (Llovera et al.) replicated the effect of a neuroprotectant (CD49 antibody) in one, but not another model of stroke, while the study by Kleikers et al. failed to reproduce previous findings claiming that NOX2 inhibition is neuroprotective in experimental stroke. In Psychology, the Open Science Collaboration conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original materials when available. Disapointingly but not surprisingly, replication rates were low, and studies that replicated did so with much reduced effect sizes.
In 2005 PLOS Medicine published John Ioannidis’ paper ‘Why most published research findings are false’ . The article was a wake up call for many, and now is probably the most influential publication in biomedicine of the last decade (>1.14 Mio views on the PLOS Med webside, thousands of citations in the scientific and lay press, featured in numerous blog posts, etc.). Its title has never been refuted, if anything, it has been replicated, for examples see some of the posts of this blog. Almost 10 years after, Ioannidis now revisits his paper, and the more constructive title ‘How to make more published research true” (PLoS Med. 2014 Oct 21;11(10):e1001747. doi: 10.1371/journal.pmed.1001747.) already indicates that the thrust this time is more forward looking. The article contains numerous suggestions to improve the research enterprise, some subtle and evolutionary, some disruptive and revolutionary, but all of them make a lot of sense. A must read for scientists, funders, journal editors, university administrators, professionals in the health industry, in other words: all stakeholders within the system!
Statistical power is a rare commodity in experimental biomedicine (see previous post), as most studies have very low n’s and are therefore severly underpowered. The concept of statistical power, although almost embarrassingly simple (for a very nice treatment see Button et al.), is shrouded in ignorance, mysteries and misunderstandings among many researchers. A simple definition states that Power is the probability that, given a specified true difference between two groups, the quantitative results of a study will be deemed statistically significant. The most common misunderstanding may be that power should only be a concern to the researcher if the Null hypothesis could not rejected (p>0.05). I need to deal with this dangerous fallacy in a future post. Another common albeit less perilous misunderstanding is that calculating post-hoc (or ‘retrospective )’ power can explain why an analysis did not achieve significance. Besides proving a severe bias of the researcher towards rejecting the Null hypothesis (‘There must be another reason for not obtaining a significant result than that the hypothesis is incorrect!), this is the equivalent of a statistical tautology. Of course the study was not powerful enough, this is why the result was not significant! To look at this from another standpoint: Provided enough n’s, the Null of every study must be reject. This by the way, is one of the most basic criticisms of Null hypothesis significance testing. Power calculations are useful for the design of studies, but not for their analysis. This was nicely explained by Steven Goodman in his classic article ‘Goodman The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results Ann IntMed 1994‘:
First, [post-hoc Power analysis] will always show that there is low power (< 50%) with respect to a nonsignificant difference, making tautological and uninformative the claim that a study is “underpowered” with respect to an observed nonsignificant result. Second, its rationale has an Alice-in-Wonderland feel, and any attempt to sort it out is guaranteed to confuse. The conundrum is the result of a direct collision between the incompatible pretrial and post-trial perspectives. […] Knowledge of the observed difference naturally shifts our perspective toward estimating differences, rather than deciding between them, and makes equal treatment of all nonsignificant results impossible. Once the data are in, the only way to avoid confusion is to not compress results into dichotomous significance verdicts and to avoid post hoc power estimates entirely.
NB: To avoid misunderstandings: Calculating the n’s needed in future experiments to achieve a certain statistical power based on effect sizes and variance obtained post – hoc from a (pilot) experiment is not called post-hoc power analysis (and the subject of this post), but rather sample size calculation.
For further reading: