Sample size and Power analysis for propensity score methods for causal inference in non-randomized interventional studies

Hello everyone. I am looking for validated procedures for sample size estimation (of nonrandomized intervention studies) and power analysis (of electronic health record-based studies with limited number of records) for propensity score methods (matching and weighting).

Are there recommended procedures?

I found these two:

However, the first requires information on the C statistic of the model that estimates the propensity scores, which is not usually reported in the studies (should it be reported more often?). The second simplifies the problem to a weighted Mantel-Haenzel test but no practical suggestions are given.


I believe your question is not complete for several reasons.

  1. Sample Size and Power estimation depends on several factors besides : Type1 error rate, Power and effect size of interest. It depends on the type of study , is this time fixed, time varying, etc
  2. The use of Propensity score methods or Matching indicates that you are trying to make your covariates independent of exposure. You are trying to decouple the effects of covariates on exposure and mimic RCT. This has nothing to do with power and sample size estimation.

So based on these factors I think your question raises more questions.


Minor comment: \alpha is not a rate and it is not an error probability. See Biostatistics for Biomedical Research - 22  Controlling α vs. Probability of a Decision Error


I think the question is good. I am inexperienced with PSM, and have also wondered about how one should think about power when using PSM methods myself. IMO there is at least three slightly different related questions:

(1) What is the needed sample size (for a given effect etc), assuming the sample will be weighted/matched?

Simplest case: Suppose your PSM achieves the ideal perfect 1:1 matching, comparable to observing the counterfactual outcome Y_i^{untreated} for each Y_i^{treated} for each subject i. Then it kind of makes sense consider eg paired t-test instead of unpaired t-test. And consequently, prior to running the experiment, one should consider calculating power for the paired test, too. Counterargument: maybe despite the similar propensity scores, your matches are not similar enough to warrant a paired test.

The second question: (2) Is there any specific sample size required to estimate an adequate PSM model itself?

The third question: (3) What is the practical sample size needed so that I retain as many data points as (1)+(2) suggest after I have run all the propensity score procedures? (Unmatched samples after matching, or trimming of weights).

So, what about answers? I am too inexperienced to dare to say anything with much confidence, but in my very limited experience, I think it is quite common for people take (1) in account in computing the post-matching error estimates. (See eg this MatchIt vignette Estimating Effects After Matching • MatchIt where they recommend cluster-robust standard errors or block bootstrap as per Abadie and Spiess 2019.) So I think practitioners are definitely thinking about it (matching affecting the error estimates). But I think it is a bit less common to calculate power/sample size assuming paired/clustered versions of tests?

Concerning (2), I have no idea, but intuitively it kinda makes sense (more datapoints for the logistic model I have, better it should be, right?) I think there is a lot of literature on balance diagnostics, but I don’t know if it it anything about this. And finally for (3), I found blog entry and comments therein helpful: Power Calculations for Propensity Score Matching? .

There is something mysterious here. The original sample size is often MUCH higher than twice the number of surviving pairs after matching. That casts doubt on the matching approach IMHO.