The MEDLINE currently indexes 5,642 journals. PubMed comprises more than 24 million citations for biomedical literature from MEDLINE. My field is stroke research. Close to 30.000 articles were published in 2014 on the topic ‘Stroke’ (clinical and experimental), more than 20.000 of them were peer reviewed original articles in the English language (Web of Science). That amounts to more than 50 articles every day. In 2014, 1700 of them were rodent studies, a mere 5 per day. Does (can) anyone read them? And should we read them? Do researchers worldwide every day produce knowledge worth publishing in 50 articles?
In 2005 PLOS Medicine published John Ioannidis’ paper ‘Why most published research findings are false’ . The article was a wake up call for many, and now is probably the most influential publication in biomedicine of the last decade (>1.14 Mio views on the PLOS Med webside, thousands of citations in the scientific and lay press, featured in numerous blog posts, etc.). Its title has never been refuted, if anything, it has been replicated, for examples see some of the posts of this blog. Almost 10 years after, Ioannidis now revisits his paper, and the more constructive title ‘How to make more published research true” (PLoS Med. 2014 Oct 21;11(10):e1001747. doi: 10.1371/journal.pmed.1001747.) already indicates that the thrust this time is more forward looking. The article contains numerous suggestions to improve the research enterprise, some subtle and evolutionary, some disruptive and revolutionary, but all of them make a lot of sense. A must read for scientists, funders, journal editors, university administrators, professionals in the health industry, in other words: all stakeholders within the system!
Riddle me this:
What does it mean if a result is reported as significant at p < 0.05?
A If we were to repeat the analysis many times, using new data each time, and if the null hypothesis were really true, then on only 5% of those occasions would we (falsely) reject it.
B Without knowing the statistical power of the experiment, and not knowing the prior probability of the hypothesis, I cannot estimate the probability whether a significant research finding (p < 0.05) reflects a true effect.
C The probability that the result is a fluke (the hypothesis was wrong, the drug doesn’t work, etc.), is below 5 %. In other words, there is a less than 5 % chance that my results are due to chance.
(solution at the end of this post)
Be honest, although it doesn’t sound very sophisticated (as opposed to A and B), you were tempted to chose C, since it makes a lot of sense, and represents your own interpretation of the p-value when reading and writing papers. You are in good company. But is C really the correct answer?
In July, Laborjournal (‘LabTimes’), a free German monthly for life scientists (sort of a hybrid between the Economist and the British Tabloid The Sun), celebrated its 20th anniversary with a special issue. I was asked to contribute an article. In it I try to answer the question whether most published research findings are false, as John Ioannidis rhetorically asked in 2005.
To find out, you have to be able to read German, and click here for a pdf of the article (in German).
“Five sigma,” is the gold standard for statistical significance in physics for particle discovery. When the New Scientist reported about the putative confirmation of the Higgs boson, they wrote:
‘Five-sigma corresponds to a p-value, or probability, of 3×10-7, or about 1 in 3.5 million. There’s a 5-in-10 million chance that the Higgs is a fluke.’
Does that mean that p-values can tell us the probability of being correct about our hypotheses? Can we use p-values to decide about the truth (correctness) of hypotheses? Does p<0.05 mean that there is a smaller than 5 % chance that an experimental hypothesis is wrong?
Statistical power is a rare commodity in experimental biomedicine (see previous post), as most studies have very low n’s and are therefore severly underpowered. The concept of statistical power, although almost embarrassingly simple (for a very nice treatment see Button et al.), is shrouded in ignorance, mysteries and misunderstandings among many researchers. A simple definition states that Power is the probability that, given a specified true difference between two groups, the quantitative results of a study will be deemed statistically significant. The most common misunderstanding may be that power should only be a concern to the researcher if the Null hypothesis could not rejected (p>0.05). I need to deal with this dangerous fallacy in a future post. Another common albeit less perilous misunderstanding is that calculating post-hoc (or ‘retrospective )’ power can explain why an analysis did not achieve significance. Besides proving a severe bias of the researcher towards rejecting the Null hypothesis (‘There must be another reason for not obtaining a significant result than that the hypothesis is incorrect!), this is the equivalent of a statistical tautology. Of course the study was not powerful enough, this is why the result was not significant! To look at this from another standpoint: Provided enough n’s, the Null of every study must be reject. This by the way, is one of the most basic criticisms of Null hypothesis significance testing. Power calculations are useful for the design of studies, but not for their analysis. This was nicely explained by Steven Goodman in his classic article ‘Goodman The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results Ann IntMed 1994‘:
First, [post-hoc Power analysis] will always show that there is low power (< 50%) with respect to a nonsignificant difference, making tautological and uninformative the claim that a study is “underpowered” with respect to an observed nonsignificant result. Second, its rationale has an Alice-in-Wonderland feel, and any attempt to sort it out is guaranteed to confuse. The conundrum is the result of a direct collision between the incompatible pretrial and post-trial perspectives. […] Knowledge of the observed difference naturally shifts our perspective toward estimating differences, rather than deciding between them, and makes equal treatment of all nonsignificant results impossible. Once the data are in, the only way to avoid confusion is to not compress results into dichotomous significance verdicts and to avoid post hoc power estimates entirely.
NB: To avoid misunderstandings: Calculating the n’s needed in future experiments to achieve a certain statistical power based on effect sizes and variance obtained post – hoc from a (pilot) experiment is not called post-hoc power analysis (and the subject of this post), but rather sample size calculation.
For further reading:
I just stumbled into a very instructive example which illustrates that p-values should not be misinterpreted as measures of the probablity with which a research hypothesis is true. In 2011 the OPERA collaboration reported evidence that neutrinos travel faster than light, a finding which violates Einstein’s theory of relativity and if true would have shattered physics as we know it! Their analysis was significant at the 6 sigma level, even more stringent than the accepted but already brutal 5 sigma level of particle discovery (p=3.5 x 10-7). Extraordinary claims require extraordinary evidence ! The results were replicated by the same group, published, and hailed by the world scientific and lay press. A short while later it turned out that the GPS systems were not properly synchronized, and a cable was loose. Neutrinos are back at the speed of light, and we can learn from this that p-values are ignorant of simple systematic errors!