Power and Sample Size Calculations in Pilot Studies

This happens from time to time: I work with someone that is putting together an application for a pilot study, usually with a funding mechanism / budget that will only support enrollment of a few dozen subjects that is clearly meant to be a stepping stone. Yet, the PI often feels obligated to write something about “Power” even though I feel a better tack would be simply justifying what we will be able to learn from the proposed sample size (which will almost never be anything related to the primary endpoint; rather, we will typically learn about the feasibility of scaling upwards to a larger study, enrollment challenges, things about our forms and questionnaires and visit schedule). I am simply curious what other people feel about this, or what others have done in similar situations.


i’ve had several such examples recently. One was nominally labelled a ‘pilot study’ and there was no desire for a power calculation. I work in pediatrics where small samples are not unusual; despite the dozen or so patients, they expect the pilot to be publishable. I think we edged away from statistical testing so there would be nothing to base a power calculation on anyway (it was clear in their minds that assessing feasibility was the point). In the 2nd example i was approached after they had submitted to a journal (or were about to submit). It wasn’t a pilot but the same issue existed - they were required to say something about power. Their paper reported many spearman correlations so i simply determined the CI width obtainable given the sample size and then declared ‘this is reasonable’ (i think i also suggested the term ‘convenience sample’). The 3rd case it was also a pilot study (although it wasn’t presented that way, i think we called them ‘preliminary results’, we wanted data to inform a power calculation for a subsequent study etc). The conference organisers were demanding a restrospective power calculation. I told them to refuse and how to word the refusal. Very little consistency across these examples and I’m not saying I handled them perfectly :slight_smile: It depends on who’s demanding the words re power. In your case, because it’s the researcher i’d edge away from it; i’d share with them another proposal or paper you’ve done for a pilot where no power calculation was described. There’s a tendency to do what was done before

edit: the other idea is to do the power calculation for them, show them the power is inadequate and then ask if they still want to report it in the proposal

1 Like

In such situations, I specifically focus on feasibility metrics given that the pilot study is being conducted to assess the feasibility of the project and to scale to a bigger study. Some FOAs specifically instruct the reviewers to ‘take off points’ if a power analysis is provided. The important thing is to clearly define the feasibility metric and define success of the study based on those metrics.


I think there is some value in power calculations generated by simulation, even in the setting of a pilot study. The value is NOT the actual power or sample-size calculations (which are often hogwash for larger studies as well). The value of constructing a simulation is that it helps flesh-out the details of the analysis plan. It requires the research team to make decisions about the type of model, the form of the covariates, and the method of inference. Constructing a simulation helps the research team understand the best- and worst-case scenarios.

Should power calculations be included as part of a pilot study grant application? Probably not. However, I think constructing a simulation is a helpful tool for thinking through the details of the analysis plan.

Besides fully emphasizing the feasibility aspects, if statistical details are really required I try to avoid power calculations for pilot studies and just do precision calculations. For example I compute for the main response variable the likely margin of error (half-width of 0.95 confidence interval) that would result with the planned sample size. See here for example.


Thanks very much to all that have responded.

Good for them! If the PI well and truly understands this, the problem I describe seems fairly moot.

Ah, this old chestnut. I have also seen journals with well-meaning requirements that the authors must make a statement about the “power” that have enforced this in situations where it really ought not to apply. Such as:


Neither do I claim to have handled just about anything perfectly. In many cases, I simply hope that my contribution kept things from being worse than it would’ve been without my contribution.

The “tendency to do what was done before” is very real, and quite harmful to progress (because it seems to discourage actual thought about what’s appropriate for the current proposal). In some cases I simply don’t have a “what was done before” example to use (I’m only five years post-PhD, and so may not have quite the portfolio that others have amassed).

Rameela, very glad to see you contributing here! I hope you will be around for further discussions.

Completely agreed that this is the most appropriate solution for pilot study applications. It is interesting that some instruct reviewers to penalize applications that provide an inappropriate power analysis.

I cannot disagree with any of that!

I will start pushing a bit harder for this on any future applications.

Many thanks to all discussants.

1 Like

The journal wants something about power

This may be time to tell the journal what a pilot study is. The power to detect what it’s intended to detect, in the case of a pilot study, is the power to detect errors, omissions, potential improvements in the protocol. So its power depends on the methods used to do these things.

On the other hand, I have seen investigators try to do a small, underpowered study and excuse it as a ‘pilot’.


Resurrecting this thread in response to an addition to the “statistical myths” thread, the gist of which is “don’t test effectiveness in pilot trials, as they will be underpowered”. I’ve had a couple of brief discussions regarding this on Twitter, but would like to explore in a bit more detail here.

While I agree that we should avoid low-powered tests in pilots (since they would lead to many promising interventions being discarded before being properly evaluated in a large trial), the low sample size of a pilot does not imply low power. A test can have as much power as we like, providing we are willing to increase the (one-sided) type I error rate from the usual 0.025 - 0.05 conventions.

If we don’t look at effectiveness in a pilot at all, this is equivalent to testing with a type I error rate of 1 and a type II error rate of 0 *****. This implies that we have an absolute preference for minimising type II errors over type I errors. I don’t think this is ever the case.

If we don’t want to test with a type I error rate of 0.025 or 1, but rather somewhere in between, how can we decide what it should be? And how do we make the related decision of the choice of pilot sample size?

All this assumes the pilot trial is estimating the true treatment effect. In many cases we might expect the pilot to be biased, and this would (in my opinion) provide more justification for not testing than the precision argument.


***** I am assuming here that not testing effectiveness means we will proceed to the main trial if it is thought to be feasible, regardless of the pilot effect estimate. I don’t think this is actually the case - if the pilot estimate is a very large negative effect, e.g. with a confidence interval entirely below 0 (assuming +ve effect favours the intervention), then of course we would not proceed. But then I would ask exactly how bad the estimate would have to be to make a “stop” decision, and what error rates would that decision rule give. Maybe we are already testing with a very high type I error rate, but doing so in an informal and opaque way?

1 Like

I think this particular quote is important. To be transparent, I added this to the wiki and not @ADAlthousePhD as indicated.

Regarding power, I think the paper by Moore explains it better than I can:

Schoenfeld suggested that preliminary hypothesis testing for efficacy could be conducted with a high type I error rate (false positive rate up to 0.25) and that subsequent testing would provide a control over the chances of a false discovery in spite of the initial elevated error rate. He suggested that this approach could ensure that the aims of the pilot study are completed in a timely manner due to the use of a smaller sample size. If such an approach is taken, it is important to remember that a pilot study should not be designed to provide definitive evidence for a treatment effect. It should, on the contrary, be designed to inform the design of a larger more comprehensive study and to provide guidance as to whether the larger study should be conducted.

My take-away from this is 1) place lower confidence in our results and 2) a pilot study should not be designed to provide definitive evidence. The gist of it being that they are underpowered to provide definitive evidence and should not be taken as such.

Anyway, I think the last point you make about bias is important, but I tend to think that bias and precision are related and both argue against doing inferential hypothesis testing in a pilot study. One of the purposes in my view of doing pilot studies in the first place is to try and identify sources of bias or challenges (for example in the screening/recruitment/enrollment/selection process of participants) which can affect estimates and their confidence intervals. 2-3 outlying observations can be wildly influential in this situation, and lead to a p-value < or > 0.05 and consequently bias future decisions w/regards to future trials.

My (relatively limited) experience with this is that although I would have liked to present data on outcomes descriptively, journal editors rarely if ever accept this and require some form of outcome-based testing regardless of the CONSORT extension for pilot studies. My pragmatic way of approaching this is to do some tests, report p-values and CIs and then making it abundantly clear that we will not be making decisions for a future trial based on these tests alone but interpret them with care and in light of the literature in general.

Thanks for flagging the Moore paper on pilot studies. One comment and one question:

  1. The link to this article by Gelman/Carlin is found in another thread but is also relevant to a discussion of pilot studies. It talks about problems in estimating plausible effect size from small-sample studies:


  1. This statement from the Moore paper confuses me:

“Because pilot clinical trials are usually small, they have a high likelihood of imbalances at baseline, leading to unreliable estimates of treatment effects.”

Is this statement correct in its attribution of treatment effect estimate “unreliability” to baseline (covariate?) imbalance? My “take-away” message from the “Table 1” thread in this forum is that inferences remain valid provided that the randomization process itself was sound (regardless of any resulting between-arm “imbalance” we might see in the distribution of any particular set of covariates we might choose to measure). My simplistic interpretation is that “random processes affect random error “ (i.e, width of the confidence interval) and “systematic processes affect systematic error/bias” (i.e., location of the point estimate).

Only after repeated reading of threads in this forum have I finally trained my brain to reject my medical school critical appraisal teaching and to stop looking for between-arm covariate imbalances as a way to “explain-away” an RCT result that seems counterintuitive. But published statements like the one in bold above seem to highlight how fairly important misconceptions get propagated through medical education programs, leading later to erroneous interpretation of certain RCTs by physicians…Maybe there’s a nuance here that I’m not understanding- happy to be corrected…


Regarding your second point I think your intuition is right here. Smaller studies will be more prone to random error even if the randomization process itself was sound. The quote from the article may be imprecise, but the impact of random error will probably be more severe in smaller trials.

Regarding systematic bias, it depends. Additive bias for example will not generally affect the point estimate, but a combination of various systematic biases might, for example person-specific biases in memory based measurements.

The idea of pre-setting the type I error probability is not required in the Bayesian paradigm. Bayes simplifies this process because one can rationally “play the odds” in decision making that rests on the posterior probability of an effect. The same posterior probability calculation could be used as in non-pilot studies, but the value of the posterior probability could be interpreted more leniently. Formal decision making such as that sometimes used in going from one stage in drug development to the next stage might also be considered.

I think this is a great question / line of discussion, and I’ll have to get back to it at some point (busy given the post-Thanksgiving crush of emails and tasks). Thanks to all of those that have chimed in, and I’ll try to get back to this soon…


Having the need to engage in a sample size justification for a pilot study recently, and consistent with Frank’s prior Aug 2018 comment regarding precision, I came across the following March 2018 reference that others may find helpful:

C.O.S. Sorzano, D. Tabas-Madrid, F. Núñez, C. Fernández-Criado, A. Naranjo.
Sample Size for Pilot Studies and Precision Driven Experiments.

There is also an online set of tools/calculators that implement the approaches taken in the above paper that are available here: