Power and Sample Size Calculations in Pilot Studies

DTWilson · November 28, 2019, 3:44pm

Resurrecting this thread in response to an addition to the “statistical myths” thread, the gist of which is “don’t test effectiveness in pilot trials, as they will be underpowered”. I’ve had a couple of brief discussions regarding this on Twitter, but would like to explore in a bit more detail here.

While I agree that we should avoid low-powered tests in pilots (since they would lead to many promising interventions being discarded before being properly evaluated in a large trial), the low sample size of a pilot does not imply low power. A test can have as much power as we like, providing we are willing to increase the (one-sided) type I error rate from the usual 0.025 - 0.05 conventions.

If we don’t look at effectiveness in a pilot at all, this is equivalent to testing with a type I error rate of 1 and a type II error rate of 0 *****. This implies that we have an absolute preference for minimising type II errors over type I errors. I don’t think this is ever the case.

If we don’t want to test with a type I error rate of 0.025 or 1, but rather somewhere in between, how can we decide what it should be? And how do we make the related decision of the choice of pilot sample size?

All this assumes the pilot trial is estimating the true treatment effect. In many cases we might expect the pilot to be biased, and this would (in my opinion) provide more justification for not testing than the precision argument.

Thoughts?

***** I am assuming here that not testing effectiveness means we will proceed to the main trial if it is thought to be feasible, regardless of the pilot effect estimate. I don’t think this is actually the case - if the pilot estimate is a very large negative effect, e.g. with a confidence interval entirely below 0 (assuming +ve effect favours the intervention), then of course we would not proceed. But then I would ask exactly how bad the estimate would have to be to make a “stop” decision, and what error rates would that decision rule give. Maybe we are already testing with a very high type I error rate, but doing so in an informal and opaque way?