The probability of replicating ‘true’ findings is low…
Due to small group sizes and presence of substantial bias experimental medicine produces a large number of false positive results (see previous post). It has been claimed that 50 – 90 % of all results may be false (see previous post). In support of these claims is the staggerlingly low number of experiments that can be replicated. But what are the chances to reproduce a finding that is actually true?
Suppose you have just analyzed one of your experiments, and found the difference in the group means to be statistically significant (say p=0.045). What are the chances that upon replication of the same experiment you will again find the difference to be significant (p<0.05)? Let us assume identical conditions, the same sample size, and that the difference is real, i.e. the difference in the measured group means reflects the true population differences. The chance to obtain a significant result and hence ‘reproduce’ your data is just about 0.5, as in flipping a coin! If you would have had a p- value of 0.01 to start with, the chance to replicate would still be disapointingly low: 73%!
This is the reason why in the realms of science where it really counts (not in biomedicine, obviously), like in particle physics, p-values smaller than p=0.0000003 (‘5 sigma‘) are required for discovery (e.g. the Higgs). Such an exceedingly low p-value still affords you ‘only’ a 99.9% chance to reproduce your findings, everything else being equal.
As a consequence of all this we should be reminded that ‘non-replication’ of a biomedical finding may be as spurious as many of the results we try to reproduce. Replicating underpowered original experiments with underpowered replication experiments may easily result in false negatives for true positives! Thus, relevant results, such as those that lead to clinical trials in humans, or are pivotal for our understanding of biological processes, need to be replicated by experiments that are sufficiently powered. In most cases this means much larger group sizes as in the original experiment!
For most of us this will be beyond the resources and capacities of our laboratories. Researchers also have little incentive to at best end up where they have started (a ‘positive’ result) or worse, invalidate their previous work by showing that it was a false positive. Even if we would want to embark on such a mission, we would have a hard time to obtain funding for it, and get regulatory approval. In many countries, including Germany, animal experiments need to lead to new knowledge. Replication does not easily fall into that category.
Besides appropriately powering our primary experiments and reducing bias as much as possible, what else can we do? We need international consortia which pool resources and conduct randomized controlled trials with sufficient power. Clinical medicine, for the reasons explained above, has come to this conclusion several decades ago. It seems that we are a bit slow in experimental medicine, but it is never too late!
Goodman A comment on replication p-values and evidence Stats Med 1992
Button et al. Power failure. Nat Neurosci2013
Ioannnidis. Why most published research findings are false. PLOSmed 2005