In 2005 PLOS Medicine published John Ioannidis’ paper ‘Why most published research findings are false’ . The article was a wake up call for many, and now is probably the most influential publication in biomedicine of the last decade (>1.14 Mio views on the PLOS Med webside, thousands of citations in the scientific and lay press, featured in numerous blog posts, etc.). Its title has never been refuted, if anything, it has been replicated, for examples see some of the posts of this blog. Almost 10 years after, Ioannidis now revisits his paper, and the more constructive title ‘How to make more published research true” (PLoS Med. 2014 Oct 21;11(10):e1001747. doi: 10.1371/journal.pmed.1001747.) already indicates that the thrust this time is more forward looking. The article contains numerous suggestions to improve the research enterprise, some subtle and evolutionary, some disruptive and revolutionary, but all of them make a lot of sense. A must read for scientists, funders, journal editors, university administrators, professionals in the health industry, in other words: all stakeholders within the system!
Statistical power is a rare commodity in experimental biomedicine (see previous post), as most studies have very low n’s and are therefore severly underpowered. The concept of statistical power, although almost embarrassingly simple (for a very nice treatment see Button et al.), is shrouded in ignorance, mysteries and misunderstandings among many researchers. A simple definition states that Power is the probability that, given a specified true difference between two groups, the quantitative results of a study will be deemed statistically significant. The most common misunderstanding may be that power should only be a concern to the researcher if the Null hypothesis could not rejected (p>0.05). I need to deal with this dangerous fallacy in a future post. Another common albeit less perilous misunderstanding is that calculating post-hoc (or ‘retrospective )’ power can explain why an analysis did not achieve significance. Besides proving a severe bias of the researcher towards rejecting the Null hypothesis (‘There must be another reason for not obtaining a significant result than that the hypothesis is incorrect!), this is the equivalent of a statistical tautology. Of course the study was not powerful enough, this is why the result was not significant! To look at this from another standpoint: Provided enough n’s, the Null of every study must be reject. This by the way, is one of the most basic criticisms of Null hypothesis significance testing. Power calculations are useful for the design of studies, but not for their analysis. This was nicely explained by Steven Goodman in his classic article ‘Goodman The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results Ann IntMed 1994‘:
First, [post-hoc Power analysis] will always show that there is low power (< 50%) with respect to a nonsignificant difference, making tautological and uninformative the claim that a study is “underpowered” with respect to an observed nonsignificant result. Second, its rationale has an Alice-in-Wonderland feel, and any attempt to sort it out is guaranteed to confuse. The conundrum is the result of a direct collision between the incompatible pretrial and post-trial perspectives. […] Knowledge of the observed difference naturally shifts our perspective toward estimating differences, rather than deciding between them, and makes equal treatment of all nonsignificant results impossible. Once the data are in, the only way to avoid confusion is to not compress results into dichotomous significance verdicts and to avoid post hoc power estimates entirely.
NB: To avoid misunderstandings: Calculating the n’s needed in future experiments to achieve a certain statistical power based on effect sizes and variance obtained post – hoc from a (pilot) experiment is not called post-hoc power analysis (and the subject of this post), but rather sample size calculation.
For further reading:
Discrepancies in the publication of clinical trials of bone marrow stem cell therapy in cardiology scale linearly with effect size! This is the shocking but not so surprising result of a study in BMJ that found over 600 discrepancies in 133 reports from 49 trials. Trials without discrepancies (only 5!) reported neutral results (i.e. no effect of therapy on enhancement of ejection fraction). The most spectacular treatment effects were found in those trials with the highest number of discrepancies (30 and more).
Steve Perrin in a recent issue of NATURE (Vol 507, p.423) summarizes the struggle of the amyotrophic lateral sclerosis (ALS) field to explain the multiple failures of clinical trials testing compounds to improve the symptoms and survival of patients with this disease. He reports the efforts of the ALS Therapy Development Institute (TDI) in Cambridge, Massachusetts, to reproduce the results of around 100 mouse studies which had yielded promising results. As it turned out, most of them, including the ones that led to clinical trials, could not be reproduced, and those where an effect was seen it was dramatically lower than the one reported initially. He discusses a number of measures that need to be taken to improve this situation, all of which have been emphasized independently in other fields of biomedicine where bench to bedside translation has failed.
Research on animals generally lacks transparent reporting of study design and implementation, as well as results. As a consequene of poor reporting, we are facing problems in replicating published findings, publication of underpowered studies and excessive false positives or false negatives, publication bias, and as a result difficulties in translating promising preclinical results into effective therapies for human disease. To improve the situation, in 2010 the ARRIVE guidelines for the reporting of animal research (www.nc3rs.org.uk/ARRIVEpdf) were formulated, which were adopted by over 300 scientifc journals, including the Journal of Cerebral Blood Flow and Metabolism (www.nature.com/jcbfm). Four years after, Baker et al. ( PLoS Biol 12(1): e1001756. doi:10.1371/journal.pbio.1001756) have systematically investigated the effect of the implementation of the ARRIVE guidelines on reporting of in vivo research, with a particular focus on the multiple sclerosis field. The results are highly disappointing:
‘86%–87% of experimental articles do not give any indication that the animals in the study were properly randomized, and 95% do not demonstrate that their study had a sample size sufficient to detect an effect of the treatment were there to be one. Moreover, they show that 13% of studies of rodents with experimental autoimmune encephalomyelitis (an animal model of multiple sclerosis) failed to report any statistical analyses at all, and 55% included inappropriate statistics.. And while you might expect that publications in ‘‘higher ranked’’ journals would have better reporting and a more rigorous methodology, Baker et al. reveal that higher ranked journals (with an impact factor greater than ten) are twice as likely to report either no or inappropriate statistics’ (Editorial by Eisen et al., PLoS Biol 12(1): e1001757. doi:10.1371/journal.pbio.1001757).
It is highly likely that other fields in biomedicine have a similar dismal record. Clearly, there is a need for journal editors and publishers to enforce the ARRIVE guidelines and to monitor its implementation!
The Economist reported that John Ioannidis, together with Steven Goodman, later this month will open the Meta – Research Innovation Center at Standford University (METRICS). Generously supported by the Buck foundation , it will fight bad science, bias, and lack of evidence in all areas of biomedicine. The institute’s moto is to ‘Identify and minimise persistent threats to medical research quality’. Those who have followed the work of Ioannidis and Goodman know that this is good news indeed! A concise overview of Ioannidis research can be found in this online article at Maclean’s.
Due to small group sizes and presence of substantial bias experimental medicine produces a large number of false positive results (see previous post). It has been claimed that 50 – 90 % of all results may be false (see previous post). In support of these claims is the staggerlingly low number of experiments that can be replicated. But what are the chances to reproduce a finding that is actually true?