In my previous post I had a look at the culture of science in physics, and found much that we life scientists might want to copy. Physics itself, and especially particles physics, present a goldmine of lessons to be learned, two of which I would like to discuss with you today.
Some of you will remember: In 2001 the results of a large international experiment convulsed not only the field of physics; it shook the whole world. On September 22nd the New York Times ran it on the front page: “Einstein Roll Over? Tiny neutrinos may have broken cosmic speed limit”! What had happened?
A very complex experimental setup had been mustered to measure the speed of neutrinos. They had been produced by the particle accelerator at CERN in Geneva and sent on a 730 km-long trip. Their arrival was registered by a detector blasted through thousands of meters of rock in the Dolomites. And what do you know: the neutrinos arrived faster than would the photons travelling the same route! Even the non-physicist will realize immediately what is at stake here (special relativity theory) and what possibilities this might open (e.g. journeys in time). The physicists of course had realized this immediately, which is why they were being so careful: For one thing they raised the significance level required for the discovery of new elementary particles from an incredible 5 sigma (corresponding with p<3*10-7 !) to 6 Sigma. Moreover, they repeated the experiment several times, but the neutrinos didn’t bother about the speed of light, and they reached significance at the increadible level of 6.2 sigma! So in no time flat the world press was informed and a paper was written. Mind you, the authors already had doubts about their own findings in spite of the record breaking p-values, which explains why the article ended: “The potentially great impact of the result motivates the continuation of our studies in order to investigate possible still unknown systematic effects that could explain the observed anomaly”.
We all know that regarding time travel we have yet to get past the movie stage. And photons still hold the absolute speed record. So what had happened? In the weeks following the media excitement, the physicists had another good look at their experimental setup, and found that the GPS used to measure distances was not correctly synchronized and, if you can believe it, a cable was loose!
And the Moral of the Story is: Don’t believe your p-values!
The physicists were right to use a radically low significance level for a highly improbable hypothesis. But – trivial as it seems — if the experimental setup contains a systematic error, neither an oh-so-low p-value nor a replication of the same experiment will be of any use whatsoever. What we in life sciences can glean from this: To answer the question of whether or not our hypothesis is correct, whether a drug works etc., a p-value is of little, and often even of no value. And: Replication of an experiment in the same laboratory is of very limited value (see also here).
Now the reason that this is currently so important is that an All-Star Team of statisticians, epidemiologists and psychologists in Nature Human behavior have just made a suggestion that is causing quite a stir: The significance level introduced in the 1920ies should be lowered by a magnitude. From the p<0.05 revered by us as a natural constant down to p<0.005! Naturally, the authors are right to say that it could clearly reduce the rate of false positives under which we all suffer. And with that the number of published studies, since many would not make it over the 0.005 hurdle.
I consider the suggestion to be wrong – also with the OPERA-Flop in mind. It will be clear to the experts who suggested this measure what a p-value is and what it is not. They know that not only alpha, i.e. the type-I error, is important for determining whether a result is a false positive. It also depends on beta (the type II error, i.e. the power), and on the probability of the hypothesis being correct (the base rate). So the experts didn’t mix up the p-value with the positive predictive value, as so many of us do. But in directing our attention in this way to the p-value, i.e. on a particular p-value, they give it a certain place of honor. And with that they let it appear that the p-value is suitable after all for deciding whether a hypothesis is right or wrong – it just has to be below a certain threshold. The article contains all the relevant arguments, but media reports on the suggestion center only on the new hurdle, and with that, on “saving” the p-value.
And that is why we life scientists can draw a valuable lesson from physics. In OPERA the hypothesis was very unlikely. In the end, the result was a false positive despite an exorbitantly low p-value. There was a systematic error in the experimental setup. In the LIGO experiment with which recently the long-sought gravitation waves were finally verified, the opposite was the case: Here, the result was thought to be known beforehand. The gravitation waves forecast by Einstein in 1919 had to be there, because all predictions of the general relativity theory had up to then been proven in experiments. And no serious argument had been made for why there should not be gravitational waves. The problem with this was just that scientists had tried nonstop to prove them since 1919. With no success. In other words, experimental physicists obtained one NULL result after the other, but they correctly didn’t give up attributing it to the insufficient sensitivity of their experiments. And they worked on improving them. And the moral of the story is: Don’t trust your p values! The non-significant p-value (i.e. the NULL results) do not mean that the phenomenon being examined does not exist. And here again the experimental setup of the LIGO-predecessors was systematically “erroneous”.
What do we learn from these apparently exotic examples from the realm of physics… let us say, from the toughest of the natural sciences? Statistical significance, or the absence thereof, is not really helpful when it is a question of whether our hypotheses are right or wrong. Statistical significance is overestimated by us scientists, just as it is by journal editors and reviewers. Instead, we should focus on effect sizes, variance and above all on the solidity of the experimental design when judging or publishing scientific results.
OPERA collaboration, T. Adam et al., Measurement of the neutrino velocity with the OPERA detector in the CNGS beam, arXiv:1109.4897v2.
Daniel J. Benjamin et al. Redefine statistical significance. Nat Hum Behav Published 1: 0189. DOI: 10.1038/s41562-017-0189-z
Valentin Amrhein, Sander Greenland. Remove, rather than redefine, statistical significance Nat human behaviour Nature Human Behaviour. 1: 0224. doi:10.1038/s41562-017-0224-0
A German version of this post has been published as part of my monthly column in the Laborjournal: http://www.laborjournal-archiv.de/epaper/LJ_17_12/20/