Currently, a worldwide discussion among stakeholders of the biomedical research enterprise revolves around the recent realization that a significant proportion of the current resources spent on medical research are wasted, as well as around potential actions to increase its value. The reproducibility of results in experimental biomedicine is generally low, and the vast majority of medical interventions introduced into clinical testing after successful preclinical development prove unsafe or ineffective. One prominent explanation for these problems is flawed preclinical research. There is consensus that the quality of biomedical research needs to be improved. ‘Quality’ is a broad and generic term, and it is clear that a plethora of factors together determine the robustness and predictiveness of basic and preclinical research results. Against this background the experimental laboratories of the Center for Stroke Research Berlin (CSB, Dept. of Experimental Neurology) have decided to take a systematic approach and to implement a structured quality management system. In a process involving all members of the department from student to technician, post doc, and group leader in over more than one year we have established standard operating produres, defined common goals and indicators, improved communication structures and document management, implemented an error management, are developing an electronic laboratory notebook, among other measures. On July 3rd 2014 this quality management system successfully passed an ISO 9001 certification process (Certificate 12 100 48301 TMS). The auditors were impressed by the quality oriented ‘spirit’ of all members of the Department, and the fact that to their knowledge the CSB is the first academic institution worldwide which has established a structured quality management in experimental research of this standard and reach. The CSB is fully aware of the fact that the mere fact that a certified quality management has been implemented does not guarantee translational success. However, we believe that innovation will only have impact on the improvement of the outcome of patients if it thrives on the highest possible standards of quality. Certification of our standards renders them transparent and verifiable to the research community, and serves as a first step towards a preclinical medicine in which research conduct and results can be monitored and audited by peers.
Statistical power is a rare commodity in experimental biomedicine (see previous post), as most studies have very low n’s and are therefore severly underpowered. The concept of statistical power, although almost embarrassingly simple (for a very nice treatment see Button et al.), is shrouded in ignorance, mysteries and misunderstandings among many researchers. A simple definition states that Power is the probability that, given a specified true difference between two groups, the quantitative results of a study will be deemed statistically significant. The most common misunderstanding may be that power should only be a concern to the researcher if the Null hypothesis could not rejected (p>0.05). I need to deal with this dangerous fallacy in a future post. Another common albeit less perilous misunderstanding is that calculating post-hoc (or ‘retrospective )’ power can explain why an analysis did not achieve significance. Besides proving a severe bias of the researcher towards rejecting the Null hypothesis (‘There must be another reason for not obtaining a significant result than that the hypothesis is incorrect!), this is the equivalent of a statistical tautology. Of course the study was not powerful enough, this is why the result was not significant! To look at this from another standpoint: Provided enough n’s, the Null of every study must be reject. This by the way, is one of the most basic criticisms of Null hypothesis significance testing. Power calculations are useful for the design of studies, but not for their analysis. This was nicely explained by Steven Goodman in his classic article ‘Goodman The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results Ann IntMed 1994‘:
First, [post-hoc Power analysis] will always show that there is low power (< 50%) with respect to a nonsignificant difference, making tautological and uninformative the claim that a study is “underpowered” with respect to an observed nonsignificant result. Second, its rationale has an Alice-in-Wonderland feel, and any attempt to sort it out is guaranteed to confuse. The conundrum is the result of a direct collision between the incompatible pretrial and post-trial perspectives. […] Knowledge of the observed difference naturally shifts our perspective toward estimating differences, rather than deciding between them, and makes equal treatment of all nonsignificant results impossible. Once the data are in, the only way to avoid confusion is to not compress results into dichotomous significance verdicts and to avoid post hoc power estimates entirely.
NB: To avoid misunderstandings: Calculating the n’s needed in future experiments to achieve a certain statistical power based on effect sizes and variance obtained post – hoc from a (pilot) experiment is not called post-hoc power analysis (and the subject of this post), but rather sample size calculation.
For further reading:
In the current issue of PLOS Biology Kimmelman, Mogil, and Dirnagl argue that distinguishing between exploratory and confirmatory preclinical research will improve translation: ‘Preclinical researchers confront two overarching agendas related to drug development: selecting interventions amid a vast field of candidates, and producing rigorous evidence of clinical promise for a small number of interventions. They suggest that each challenge is best met by two different, complementary modes of investigation. In the first (exploratory investigation), researchers should aim at generating robust pathophysiological theories of disease. In the second (confirmatory investigation), researchers should aim at demonstrating strong and reproducible treatment effects in relevant animal models. Each mode entails different study designs, confronts different validity threats, and supports different kinds of inferences. Research policies should seek to disentangle the two modes and leverage their complementarity. In particular, policies should discourage the common use of exploratory studies to support confirmatory inferences, promote a greater volume of confirmatory investigation, and customize design and reporting guidelines for each mode.’
In a Neuron View article, Zen Faulkes argues for blogging as a kind of postpublication peer review. He is a veteran blogger, and science tweeter, and knows what he is talking about. The article compares social media to the the classical forms of scientific discourse (from letter to the editor to talk at a conference) and likens science blogging to an online research conference, although which a much wider reach, even into the lay community. Read it here: Faulkes Neuron View
In the current issue of the Proceedings of the National Academy of Science (USA), four heavyweights, Bruce Alberts, Marc W. Kirschner, Shirley Tilghman, and Harold Varmus, provide fundamental criticism of the US biomedical research system, and offer ideas for ‘Rescuing US biomedical research from its systemic flaws’. Their main point is that ‘The long-held but erroneous assumption of never-ending rapid growth in biomedical science has created an unsustainable hypercompetitive system that is discouraging even the most outstanding prospective students from entering our profession—and making it difficult for seasoned investigators to produce their best work. This is a recipe for long-term decline, and the problems cannot be solved with simplistic approaches.’ Most of the issues they raise are equally applicable to European biomedical research. Full article: PNAS-2014-Alberts-5773-7
Steve Perrin in a recent issue of NATURE (Vol 507, p.423) summarizes the struggle of the amyotrophic lateral sclerosis (ALS) field to explain the multiple failures of clinical trials testing compounds to improve the symptoms and survival of patients with this disease. He reports the efforts of the ALS Therapy Development Institute (TDI) in Cambridge, Massachusetts, to reproduce the results of around 100 mouse studies which had yielded promising results. As it turned out, most of them, including the ones that led to clinical trials, could not be reproduced, and those where an effect was seen it was dramatically lower than the one reported initially. He discusses a number of measures that need to be taken to improve this situation, all of which have been emphasized independently in other fields of biomedicine where bench to bedside translation has failed.
Research on animals generally lacks transparent reporting of study design and implementation, as well as results. As a consequene of poor reporting, we are facing problems in replicating published findings, publication of underpowered studies and excessive false positives or false negatives, publication bias, and as a result difficulties in translating promising preclinical results into effective therapies for human disease. To improve the situation, in 2010 the ARRIVE guidelines for the reporting of animal research (www.nc3rs.org.uk/ARRIVEpdf) were formulated, which were adopted by over 300 scientifc journals, including the Journal of Cerebral Blood Flow and Metabolism (www.nature.com/jcbfm). Four years after, Baker et al. ( PLoS Biol 12(1): e1001756. doi:10.1371/journal.pbio.1001756) have systematically investigated the effect of the implementation of the ARRIVE guidelines on reporting of in vivo research, with a particular focus on the multiple sclerosis field. The results are highly disappointing:
‘86%–87% of experimental articles do not give any indication that the animals in the study were properly randomized, and 95% do not demonstrate that their study had a sample size sufficient to detect an effect of the treatment were there to be one. Moreover, they show that 13% of studies of rodents with experimental autoimmune encephalomyelitis (an animal model of multiple sclerosis) failed to report any statistical analyses at all, and 55% included inappropriate statistics.. And while you might expect that publications in ‘‘higher ranked’’ journals would have better reporting and a more rigorous methodology, Baker et al. reveal that higher ranked journals (with an impact factor greater than ten) are twice as likely to report either no or inappropriate statistics’ (Editorial by Eisen et al., PLoS Biol 12(1): e1001757. doi:10.1371/journal.pbio.1001757).
It is highly likely that other fields in biomedicine have a similar dismal record. Clearly, there is a need for journal editors and publishers to enforce the ARRIVE guidelines and to monitor its implementation!
Lost or found in translation? Stroke is a major cause of global morbidity and mortality, yet therapeutic options are very limited. Numerous preclinical studies promised highly effective novel treatments, none of which have made it into practice despite a plethora of clinical trials. This failure to bridge the gap between bench and bedside deeply frustrates researchers, clinicians, the pharmaceutical industry, and patients. Dirnagl and Endres argue that despite the apparent translational failures in neuroprotection research, and counter to current nihilism, basic and preclinical stroke research has in fact been able to predict human pathophysiology, clinical phenotypes, and therapeutic outcomes. The understanding of stroke pathobiology that has been achieved through basic research has led to changes in stroke care whose value can be demonstrated. Preclinical investigations have informed the clinical realm even in the absence of intermediary phase 2 or phase 3 trials. Their arguments rest on examples of successful bench-to-bedside translation in which experimental studies preceded human trials and successfully predicted outcomes or phenotypes, as well as on examples of successful ‘back-translation’, where studies in animals recapitulated what we already knew to be true in human beings. An analysis of the reasons for the apparent (or only perceived) translational failures further strenghtens their proposition, and suggests measures to improve the positive predictive value of preclinical stroke research. Researchers, funding agencies, academic institutions, publishers, and professional societies should work together to harness the tremendous potential of basic and preclinical research, in stroke research as well as in other fields of medicine
Ulrich Dirnagl and Matthias Endres. Found in Translation: Preclinical Stroke Research Predicts Human Pathophysiology, Clinical Phenotypes, and Therapeutic Outcomes. Stroke. 2014; 45: 1510-1518
Due to small group sizes and presence of substantial bias experimental medicine produces a large number of false positive results (see previous post). It has been claimed that 50 – 90 % of all results may be false (see previous post). In support of these claims is the staggerlingly low number of experiments that can be replicated. But what are the chances to reproduce a finding that is actually true?
Scientific publishing should be based on open access, and open evaluation. While open access is on its way, open evaluation (OE) is still controversial and only slowly seeping into the the system. Kriegeskorte, Walther, and Deca have edited a whole issue on Frontiers in Computational Neuroscience devoted to this topic, with some very scholarly and thoughtful discussions on the pros and cons of OE. I highly recommend the editorial (An emerging consensus for open evaluation), which tries to synthesize the arguments into ’18 visions’. The beauty of their blueprint for the future of scientific publication (which was already published a year ago) is that it is possible to start with the current system and slowly evolve it into a full blown OE system, while checking on the way whether the different measures deliver their promises.