Power analysis in observational studies

Does anyone have a good reference about power analysis in observational studies, mostly retrospective ones?
I have read conflicting opinions on this topic.

Sorry, but that makes almost no sense. Why exactly do you want to do a power analysis on observational data that are retrospective? Power analyses can be useful in certain contexts like when designing an experiment/trial and where there is random assignment and ability to replicate studies, but I don’t understand why anyone would do it for retrospective data, that just sounds like calculating observed power, for that discussion and why it’s a bad idea, see here please

1 Like


I can tell you, from the perspective of one who has been involved in observational studies for the better part of 40 years, I have never conducted an a priori power/sample size calculation for any of them.

There have been subsets of them, but a minority, where we pre-defined the precision of the estimates that we will likely achieve, at given prevalences, using the width of 95% confidence intervals for binary variables. Those are easy calculations to make, once you have a sense of the likely sample size that you can reasonably obtain.

Sample sizes in observational studies are typically based upon “convenience samples”. That is, given the various constraints of the prevalence of the disease and/or treatment that you are focused upon, the inclusion/exclusion criteria, the timeline for enrollment (in the prospective setting), whether you are restricted to a single study site or multiple sites, and the budget for the study, how many patients can be reasonably included?

In the retrospective setting, you have the additional challenge of dealing with the quality and completeness of the data that you can obtain, since you are entirely dependent upon what was or was not recorded in your available data sources in the past. There may be other per-patient variations in terms of diagnostic testing that was or was not done, based upon clinical practice at the time. Further, you may also have to deal with inter-observer variability in how qualitative data may have been recorded, in the absence of clear standards and definitions for these, as would typically be the case in a prospective study.

Each of those factors can impact your available sample size, if your inclusion criteria require that certain data be available for inclusion and analysis.

You may wish to consider conducting a small pilot study, to get a sense for those issues and how they may impact your study and any relevant design changes, before engaging in the full study.

1 Like

The terminology here may be the problem. The original post really isn’t very clear about the intention so it’s hard to give useful advice about such generic terms. But I can think of one place where some degree of power calculation or sample size planning could be performed but is often not.

Suppose that a cardiology fellow is about to embark on a chart review project. Most would consider this “retrospective” even if the data collection process will be prospective, e.g. we will sit down and outline what to collect and then the person may go and collect the data. In many cases, as alluded by another poster on the thread, this just turns into some form of a convenience sample, e.g. “we abstracted data for all the cases we had available” or “we abstracted the cases from the last 5 years”

In this setting, I often advocate for some thought about the question(s), effect size(s) of interest, and number of cases that would be needed to detect those effects before even setting off on the project, for two reasons. The first is that if it’s completely hopeless to find anything useful (e.g. would need 500 cases and there are only 50 available) they can try to redirect their efforts to another, potentially better project; the second is that they might be able to efficiently allocate their efforts (e.g. if we determine that 500 is a good sample size for the project, they can review exactly 500 charts and no more rather than trying to do all 4,000 cases available).

Again, hard to answer the initial query since it is so vague, but I don’t entirely dismiss the idea of power in “retrospective studies” because it’s dependent on, well, this kind of stuff.


I see your point. I just read the attached post and I got to say… what a fruitful discussion! I learned a lot, thank you.

Those are great expert insights. I’m currently on the other side: preparing for my first clinical research project, which will probably be a retrospective cohort. Thank you so much!

Sorry, I was indeed very vague. Your example is interesting, but I see it as a fully prospective study. In that case, a power analysis is essential, because as you said, investigators need to allocate time and money in a more efficient way.

If you are planning your first observational study, the EMA recently issued a very detailed draft guidance on registry studies, which is worth reading (https://www.ema.europa.eu/en/guideline-registry-based-studies#draft-under-public-consultation-section). AHRQ also publishes a periodically updated guideline for registries (https://effectivehealthcare.ahrq.gov/products/registries-guide-4th-edition/users-guide). It is important to understand that a prospective, observational registry study doesn’t generally have a sample size based on power calculations. Many registries are descriptive, reporting on “what are the outcomes of this type of patients”, “how are patients with this disease treated,” or “what are the demographics of people diagnosed with X.” You still might want to do some sample size calculations in order to understand how many patients will be required and how long the study will need to run.


These look great! I’ll definitely have a look, thank you

I am transitioning from epidemiology to biostatistics and this is my take: All epidemiologic investigation proposals must have power analysis. Period.

I have observed unfortunately that this is not the usual practice in the pharma/industry arena where I sense the “observational study bucket” is meant to broadly capture general descriptive studies. Most often, these studies are conducted without thorough considerations for methodologic issues and best practices.

It will be of limited value to execute an observational/ epidemiologic study without assessing the sample or power needed to appropriately observe the direction and size of the effect of interest before sinking in resources.

I’ll have to respectfully disagree. Power is problematic even for randomized clinical trials because it is dependent on (1) null hypothesis tests with artificial thresholds and (2) only one possible effect size to detect. And that is the best situation for power. In observational studies power is even more problematic. In either type of study it is more relevant to either (1) quantify the total information content in the data, e.g. effective sample size, or (2) estimate the margin of error in estimating the main quantity of interest (e.g. an exposure odds ratio).

Before a study is done or the sample is completed we can estimate an expected margin of error. After the sample is complete we just compute the actual margin of error. I take the term margin of error a bit loosely to be half of the width of a 0.95 compatibility interval or Bayesian highest posterior density interval.


hi Marc,

I need to estimate the precision of a predictive algorithm (95% CI’s ) using retrospective data as you seem to describe here:

"I can tell you, from the perspective of one who has been involved in observational studies for the better part of 40 years, I have never conducted an a priori power/sample size calculation for any of them.

There have been subsets of them, but a minority, where we pre-defined the precision of the estimates that we will likely achieve, at given prevalences, using the width of 95% confidence intervals for binary variables. Those are easy calculations to make, once you have a sense of the likely sample size that you can reasonably obtain."

An algorithm predicting a mutually exclusive binary outcome (success or failure) of a procedure/treatment (P/T) based on 6 patient level biometric inputs (age, weight, etc). The algorithm produces a point estimate of success, as a percentage, such as:

Patient 21 has a 35% chance of success if they undergo the (P/T)
Patient 34 has a 76% chance of success if they undergo the (P/T)

Etc, with chances of success ranging from a theoretical 0% to 100%.

I want to to a validation study of this algorithm using retrospective data.

Clinically it is important to understand the precision of these predictions (sensitivity/specificity/PPV/NPV) at different predictive point estimates. Prior validation studies of the algorithm demonstrate much greater precision when predicted success is high and diminishing precision when success is low (using deciles as in the calibration curve above). It is not clear if the lack of precision at lower estimates of predicted success are due to smaller numbers of subjects with lower scores or imprecision in the algorithm itself (e.g., unmeasured variables affecting some subjects more than others).

UNfortunately detailed findings by stata are not provided only AUC and calibration curve graphs.

I want to test this algorithm using a sample large enough to report statistically significant differences in predicted and actual success of the P/T at each decile of predicted success. I want to assure potential supporters that my sample will likely be large enough to measure precision at lower levels of predicted success.

For example, my final sample MAY have 50/150/250 people at each decile (data are still being extracted)

Is this a question you can help with?


Hi Patrick,

For clarification, the text that you quote from my prior comments was in the context of providing for an estimate of the required sample size to achieve a desired precision (half width) for 95% confidence intervals for discrete observed data in isolation, not for the estimated precision of predictions from a potentially complex multivariable model as part of model development and validation.

On the latter subject, at least initially, I would refer you to, first, the second edition of Frank’s book, “Regression Modeling Strategies”, where in the index, there are several references to estimating sample sizes for various models that you can review, along with associated validation methods and other relevant content.

I would also refer you to Ewout Steyerberg’s book, “Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating”, where there are references to relevant topics on sample sizes in chapter 19 of his book, and as with Frank’s book, other relevant content throughout as well.

I do not use Stata, but do use R, and Frank has a number of relevant functions in his “rms” and “Hmisc” packages for engaging in model development and validation processes.

There is also a discussion thread here that you might find of interest:

My recommendation, given some of your comments, would be to seek local statistical expertise to aid you in this endeavor. You will likely want to take advantage of that expertise in a collaborative environment, taking into account relevant prior work, and how that may reasonably influence what you do moving forward, not only with respect to the overall sample size that may be apropos, but also any considerations for the characteristics of that sample, and the implications of the intended decision making that should be part of the overall process.

1 Like

Deciles should play no role.