And the Moral of the Story is: Don’t believe your p-values!
In my previous post I had a look at the culture of science in physics, and found much that we life scientists might want to copy. Physics itself, and especially particles physics, present a goldmine of lessons to be learned, two of which I would like to discuss with you today.
Some of you will remember: In 2001 the results of a large international experiment convulsed not only the field of physics; it shook the whole world. On September 22nd the New York Times ran it on the front page: “Einstein Roll Over? Tiny neutrinos may have broken cosmic speed limit”! What had happened?
A very complex experimental setup had been mustered to measure the speed of neutrinos. They had been produced by the particle accelerator at CERN in Geneva and sent on a 730 km-long trip. Their arrival was registered by a detector blasted through thousands of meters of rock in the Dolomites. And what do you know: the neutrinos arrived faster than would the photons travelling the same route! Even the non-physicist will realize immediately what is at stake here (special relativity theory) and what possibilities this might open (e.g. journeys in time). The physicists of course had realized this immediately, which is why they were being so careful: For one thing they raised the significance level required for the discovery of new elementary particles from an incredible 5 sigma (corresponding with p<3*10-7 !) to 6 Sigma. Moreover, they repeated the experiment several times, but the neutrinos didn’t bother about the speed of light, and they reached significance at the increadible level of 6.2 sigma! So in no time flat the world press was informed and a paper was written. Mind you, the authors already had doubts about their own findings in spite of the record breaking p-values, which explains why the article ended: “The potentially great impact of the result motivates the continuation of our studies in order to investigate
possible still unknown systematic effects that could explain the observed anomaly”.
We all know that regarding time travel we have yet to get past the movie stage. And photons still hold the absolute speed record. So what had happened? In the weeks following the media excitement, the physicists had another good look at their experimental setup, and found that the GPS used to measure distances was not correctly synchronized and, if you can believe it, a cable was loose!
And the Moral of the Story is: Don’t believe your p-values!
The physicists were right to use a radically low significance level for a highly improbable hypothesis. But – trivial as it seems — if the experimental setup contains a systematic error, neither an oh-so-low p-value nor a replication of the same experiment will be of any use whatsoever. What we in life sciences can glean from this: To answer the question of whether or not our hypothesis is correct, whether a drug works etc., a p-value is of little, and often even of no value. And: Replication of an experiment in the same laboratory is of very limited value (see also here).
Now the reason that this is currently so important is that an All-Star Team of statisticians, epidemiologists and psychologists in Nature Human behavior have just made a suggestion that is causing quite a stir: The significance level introduced in the 1920ies should be lowered by a magnitude. From the p<0.05 revered by us as a natural constant down to p<0.005! Naturally, the authors are right to say that it could clearly reduce the rate of false positives under which we all suffer. And with that the number of published studies, since many would not make it over the 0.005 hurdle.
I consider the suggestion to be wrong – also with the OPERA-Flop in mind. It will be clear to the experts who suggested this measure what a p-value is and what it is not. They know that not only alpha, i.e. the type-I error, is important for determining whether a result is a false positive. It also depends on beta (the type II error, i.e. the power), and on the probability of the hypothesis being correct (the base rate). So the experts didn’t mix up the p-value with the positive predictive value, as so many of us do. But in directing our attention in this way to the p-value, i.e. on a particular p-value, they give it a certain place of honor. And with that they let it appear that the p-value is suitable after all for deciding whether a hypothesis is right or wrong – it just has to be below a certain threshold. The article contains all the relevant arguments, but media reports on the suggestion center only on the new hurdle, and with that, on “saving” the p-value.
And that is why we life scientists can draw a valuable lesson from physics. In OPERA the hypothesis was very unlikely. In the end, the result was a false positive despite an exorbitantly low p-value. There was a systematic error in the experimental setup. In the LIGO experiment with which recently the long-sought gravitation waves were finally verified, the opposite was the case: Here, the result was thought to be known beforehand. The gravitation waves forecast by Einstein in 1919 had to be there, because all predictions of the general relativity theory had up to then been proven in experiments. And no serious argument had been made for why there should not be gravitational waves. The problem with this was just that scientists had tried nonstop to prove them since 1919. With no success. In other words, experimental physicists obtained one NULL result after the other, but they correctly didn’t give up attributing it to the insufficient sensitivity of their experiments. And they worked on improving them. And the moral of the story is: Don’t trust your p values! The non-significant p-value (i.e. the NULL results) do not mean that the phenomenon being examined does not exist. And here again the experimental setup of the LIGO-predecessors was systematically “erroneous”.
What do we learn from these apparently exotic examples from the realm of physics… let us say, from the toughest of the natural sciences? Statistical significance, or the absence thereof, is not really helpful when it is a question of whether our hypotheses are right or wrong. Statistical significance is overestimated by us scientists, just as it is by journal editors and reviewers. Instead, we should focus on effect sizes, variance and above all on the solidity of the experimental design when judging or publishing scientific results.
References
OPERA collaboration, T. Adam et al., Measurement of the neutrino velocity with the OPERA detector in the CNGS beam, arXiv:1109.4897v2.
Daniel J. Benjamin et al. Redefine statistical significance. Nat Hum Behav Published 1: 0189. DOI: 10.1038/s41562-017-0189-z
Valentin Amrhein, Sander Greenland. Remove, rather than redefine, statistical significance Nat human behaviour Nature Human Behaviour. 1: 0224. doi:10.1038/s41562-017-0224-0
A German version of this post has been published as part of my monthly column in the Laborjournal: http://www.laborjournal-archiv.de/epaper/LJ_17_12/20/
A very timely and relevant post indeed. Reification of p-values has proven a double-edged sword – at best! On the other hand, if only education and training of scientists were the problem, why were we trained and have we trained our colleagues so inadequately all these decades?
According to my experience with biology students (and some colleagues), a well-rehearsed route to help them navigate a complex space of biological ‘messiness’ and counter-intuitive stochastics (there is a reason they studied biology after all) is like a lifeline in a world where there is always more to learn than one could possibly cope with. A fixed p-value is such a lifeline for all those who struggle with how to deploy stochastic concepts in their particular experiments. All those who do not face such struggles, p-values have never been much of a god anyway.
Given such diversity in our work force, I think you post may lead to some misunderstandings (but maybe it’s just me). Should we know stubbornly insist that our effect is there even if we repeatedly fail to find it? Do experiments until something happens that finally shows we were right all along? Or should we, equally stubbornly, disregard all the effects we believe cannot be there because they do not fit our concept of how biology works? Essentially this would amount to the methodological “anything goes” by Feyerabend and render science just another humanity where the consensus is shaped not by data, but by the most convincing narrators. I do not sense that this is the message you intend to convey, but one could understand your post as promoting intuition or theory over data.
Alternatively, one could interpret the physicists’ behavior as confirmation bias: we don’t check the measurement of gravitational waves as thoroughly as supra-fast neutrinos, because we expect one but not the other. Ideally, science would be so diverse in its composition, that for every claim, there would be someone so skeptical, they make absolutely certain there is no loose cable.
Or, phrased less diplomatically: I’m not sure if examples of confirmation bias are good examples for how to use statistics 🙂
Agreed on all counts! By learning from physics I did not mean to advocate simply copying their approaches, which wouldn’t make sense anyhow. But I maintain that these two examples contain a stark message for anyone interpreting p-values. They clearly demonstrate the perils of using p-values in making decisions about our hypotheses being true or false.
A comment in German: http://scienceblogs.de/gesundheits-check/2017/12/16/der-p-wert-hat-bewaehrung/
Grüße aus dem Süden!
Thanks for the great discussion, good to see a criticism of the abysmal Benjamin et al. .005 proposal (whose ‘all-star’ cast was extremely biased by a plethora of confirmationist Bayesians).
But your headline has the wrong moral – once again it blames P when the fault is the misinterpretation of P.
A very small P only indicates that something is wrong with the model used to calculate it, not what is wrong; and P was right: the model did not include the problems of a loose cable or inadequately synchronized clocks.