Since the 17th century, when gentlemen scientists were typically seen as trustworthy sources for the truth about humankind and the natural order the tenet is generally accepted that ‘science is based on trust‘. This refers to trust between scientists, as they build on each others data and may question a hypothesis, or a conclusion, but not the quality of the scientifc method applied or the faithfulness of the report, such as a publication. But it also refers to the trust of the public in the scientists which societies support via tax-funded academic systems. Consistently, scientists (in particular in biomedicine) score highest among all professions in ‘trustworthiness’ ratings. Despite often questioning the trustworthiness of their competitors when chatting over a beer or two, they publically vehemently argue against any measure proposed to underpin confidence in their work by any form of scrutiny (e.g. auditing). They instead swiftly invoke Orwellian visions of a ‘science police’ and point out that scrutiny would undermine trust and jeopardize the creativity and ingenuity inherent to the scientific process. I find this quite remarkable. Why should science be exempt from scrutiny and control, when other areas of public and private life sport numerous checks and balances? Science may indeed be the only domain in society which is funded by the public and gets away with strictly rejecting accountability. So why do we trust scientists, but not bankers?
Amidst what has been termed ‘reproducibility crisis’ (see also a number of previous posts) in June 2014 the National Institutes of Health and Nature Publishing Group had convened a workshop on the rigour and reproducibility of preclinical biomedicine. As a result, last week the NIH published ‘Principles and Guidelines for Reporting Preclinical Research‘, and Nature as well as Science ran editorials on it. More than 30 journals, including the Journal of Cerebral Blood Flow and Metabolism, are endorsing the guidelines. The guidelines cover rigour in statistical analysis, transparent reporting and standards (including randomization and blinding as well as sample size justification), and data mining and sharing. This is an important step forward, but implementation has to be enforced and monitored: The ARRIVE guidelines (many items of which reappear in the NIH guidelines) have not been adapted widely yet (see previous post). In this context I highly recommend the article by Henderson et al in Plos Medicine in which they systematically review existing guidelines for in vivo animal experiments. From this the STREAM collaboration distilled a checklist on internal, external, and construct validity which I found more comprehensive and relevant than the one published now by the NIH. In the end, however, it is not so relevant to which guideline (ARRIVE, NIH, STREAM, etc.) researchers, reviewers, editors, or funders comply, but rather whether they use one at all!
Note added 12/15/2014: Check out the PubPeer postpublication discussion on the NIH/Nature/Science initiative (click here)!
I have earlier posted on the many virtues of post-publication commenting. Only few journals presently allow readers to comment, discuss, or criticise papers online. Since 2013 the National Library of Medicine of the US National Institutes of Health, which provides us with more than 24 million citations of the MEDLINE, offers a tool to comment on any of these millions of articles: PubMed COMMONS. Although it had a slow start, and at present only less than 3000 comments are listed, it is almost doomed to be a success and has the potential to propel biomedical publishing into the 21st century. I believe that the PubMed COMMONS model is superior to post-publication commenting schemes of indvidual journals. The main advantage of their model is that every article which receives a comment is directly visible and accessible with one click for everyone retrieving the article via PubMed. Comments published in individual journals may be burried on their websites, only a tiny fraction of journals allow commenting, and no other model would allow commenting on papers dating back more than a few years. How does it work?
In 2005 PLOS Medicine published John Ioannidis’ paper ‘Why most published research findings are false’ . The article was a wake up call for many, and now is probably the most influential publication in biomedicine of the last decade (>1.14 Mio views on the PLOS Med webside, thousands of citations in the scientific and lay press, featured in numerous blog posts, etc.). Its title has never been refuted, if anything, it has been replicated, for examples see some of the posts of this blog. Almost 10 years after, Ioannidis now revisits his paper, and the more constructive title ‘How to make more published research true” (PLoS Med. 2014 Oct 21;11(10):e1001747. doi: 10.1371/journal.pmed.1001747.) already indicates that the thrust this time is more forward looking. The article contains numerous suggestions to improve the research enterprise, some subtle and evolutionary, some disruptive and revolutionary, but all of them make a lot of sense. A must read for scientists, funders, journal editors, university administrators, professionals in the health industry, in other words: all stakeholders within the system!
Riddle me this:
What does it mean if a result is reported as significant at p < 0.05?
A If we were to repeat the analysis many times, using new data each time, and if the null hypothesis were really true, then on only 5% of those occasions would we (falsely) reject it.
B Without knowing the statistical power of the experiment, and not knowing the prior probability of the hypothesis, I cannot estimate the probability whether a significant research finding (p < 0.05) reflects a true effect.
C The probability that the result is a fluke (the hypothesis was wrong, the drug doesn’t work, etc.), is below 5 %. In other words, there is a less than 5 % chance that my results are due to chance.
(solution at the end of this post)
Be honest, although it doesn’t sound very sophisticated (as opposed to A and B), you were tempted to chose C, since it makes a lot of sense, and represents your own interpretation of the p-value when reading and writing papers. You are in good company. But is C really the correct answer?
In July, Laborjournal (‘LabTimes’), a free German monthly for life scientists (sort of a hybrid between the Economist and the British Tabloid The Sun), celebrated its 20th anniversary with a special issue. I was asked to contribute an article. In it I try to answer the question whether most published research findings are false, as John Ioannidis rhetorically asked in 2005.
To find out, you have to be able to read German, and click here for a pdf of the article (in German).
“Five sigma,” is the gold standard for statistical significance in physics for particle discovery. When the New Scientist reported about the putative confirmation of the Higgs boson, they wrote:
‘Five-sigma corresponds to a p-value, or probability, of 3×10-7, or about 1 in 3.5 million. There’s a 5-in-10 million chance that the Higgs is a fluke.’
Does that mean that p-values can tell us the probability of being correct about our hypotheses? Can we use p-values to decide about the truth (correctness) of hypotheses? Does p<0.05 mean that there is a smaller than 5 % chance that an experimental hypothesis is wrong?
Amidst a flurry of retractions of research articles from high level journals and growing concerns about the non-reproducibility of research findings, the time-honored (some say old-fashioned) closed pre-publication mode of peer review has come under critique. Major issues concern the quality of reviews and lack of transparency. A number of modifications and alternative models have been proposed (e.g. Front. Comput. Neurosci. 6:94;2012), including open post-publication review. Most publications are no longer read in printed and bound issues of a journal, but rather accessed in digital form via the internet. This allows for novel forms of readership participation, such as post-publication review and online commenting and discussion of articles. Several journals are experimenting with such novel features (e.g. PLOS One), and some are based on it (e.g. F1000Research or eLife). Nevertheless, most established journals are hesitant to give up their time honored modes of publishing. They argue that closed pre-publication review may not be perfect, but that the alternatives are untested, and may actually be worse. Post-publication commenting requires software upgrades to journal websites, as well as monitoring and moderation of content, and there may be legal issues. Another problem relates to the troubling fact that a substantial fraction of the biomedical literature is not read at all (even if cited!), which means that we may not be able to solely rely on processes that take place after publication.
I just stumbled into a very instructive example which illustrates that p-values should not be misinterpreted as measures of the probablity with which a research hypothesis is true. In 2011 the OPERA collaboration reported evidence that neutrinos travel faster than light, a finding which violates Einstein’s theory of relativity and if true would have shattered physics as we know it! Their analysis was significant at the 6 sigma level, even more stringent than the accepted but already brutal 5 sigma level of particle discovery (p=3.5 x 10-7). Extraordinary claims require extraordinary evidence ! The results were replicated by the same group, published, and hailed by the world scientific and lay press. A short while later it turned out that the GPS systems were not properly synchronized, and a cable was loose. Neutrinos are back at the speed of light, and we can learn from this that p-values are ignorant of simple systematic errors!
In the current issue of PLOS Biology Kimmelman, Mogil, and Dirnagl argue that distinguishing between exploratory and confirmatory preclinical research will improve translation: ‘Preclinical researchers confront two overarching agendas related to drug development: selecting interventions amid a vast field of candidates, and producing rigorous evidence of clinical promise for a small number of interventions. They suggest that each challenge is best met by two different, complementary modes of investigation. In the first (exploratory investigation), researchers should aim at generating robust pathophysiological theories of disease. In the second (confirmatory investigation), researchers should aim at demonstrating strong and reproducible treatment effects in relevant animal models. Each mode entails different study designs, confronts different validity threats, and supports different kinds of inferences. Research policies should seek to disentangle the two modes and leverage their complementarity. In particular, policies should discourage the common use of exploratory studies to support confirmatory inferences, promote a greater volume of confirmatory investigation, and customize design and reporting guidelines for each mode.’