Why post-hoc power calculation is not helpful
Statistical power is a rare commodity in experimental biomedicine (see previous post), as most studies have very low n’s and are therefore severly underpowered. The concept of statistical power, although almost embarrassingly simple (for a very nice treatment see Button et al.), is shrouded in ignorance, mysteries and misunderstandings among many researchers. A simple definition states that Power is the probability that, given a specified true difference between two groups, the quantitative results of a study will be deemed statistically significant. The most common misunderstanding may be that power should only be a concern to the researcher if the Null hypothesis could not rejected (p>0.05). I need to deal with this dangerous fallacy in a future post. Another common albeit less perilous misunderstanding is that calculating post-hoc (or ‘retrospective )’ power can explain why an analysis did not achieve significance. Besides proving a severe bias of the researcher towards rejecting the Null hypothesis (‘There must be another reason for not obtaining a significant result than that the hypothesis is incorrect!), this is the equivalent of a statistical tautology. Of course the study was not powerful enough, this is why the result was not significant! To look at this from another standpoint: Provided enough n’s, the Null of every study must be reject. This by the way, is one of the most basic criticisms of Null hypothesis significance testing. Power calculations are useful for the design of studies, but not for their analysis. This was nicely explained by Steven Goodman in his classic article ‘Goodman The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results Ann IntMed 1994‘:
First, [post-hoc Power analysis] will always show that there is low power (< 50%) with respect to a nonsignificant difference, making tautological and uninformative the claim that a study is “underpowered” with respect to an observed nonsignificant result. Second, its rationale has an Alice-in-Wonderland feel, and any attempt to sort it out is guaranteed to confuse. The conundrum is the result of a direct collision between the incompatible pretrial and post-trial perspectives. […] Knowledge of the observed difference naturally shifts our perspective toward estimating differences, rather than deciding between them, and makes equal treatment of all nonsignificant results impossible. Once the data are in, the only way to avoid confusion is to not compress results into dichotomous significance verdicts and to avoid post hoc power estimates entirely.
NB: To avoid misunderstandings: Calculating the n’s needed in future experiments to achieve a certain statistical power based on effect sizes and variance obtained post – hoc from a (pilot) experiment is not called post-hoc power analysis (and the subject of this post), but rather sample size calculation.
For further reading:
Hoenig and Heisey says that the observed power is not always less than 50% when the p-value is non-sig.
“There is a misconception about the relationship between observed power and p value in the applied literature which
is likely to confuse nonstatisticians. Goodman and Berlin (1994), Steidl, Hayes, and Schauber (1997), Hayes and
Steidl (1997), and Reed and Blaustein (1997) asserted without proof that observed power will always be less than .5
when the test result is nonsignificant.” (They go on to give a counterexample)