Bootstrapping CIs in multiply imputed data

Background: I plan to estimate the average causal effect of a continuous exposure on my outcome using the parametric g-formula (i.e., standardized mean difference), and I need to bootstrap the 95% CIs. I’d like to use 1000 (or some other sufficiently large number of) bootstrap samples, if possible.

I also have missing data in my confounders and will impute it via MICE. I’m reasonably confident in my imputation models. Despite only a moderate degree of missingness, I’d like to create something on the order of 50 imputed datasets. These models take a lot of time to run.

Problem: Little and Rubin (2002) suggest generating bootstrap samples before imputation, then imputing in each bootstrap sample, then running MI analyses on each sample, then combining. This is to say I’d have to impute missing data 1000 different times. I know we’re all math types here, but just to be clear, that’s 50,000 imputed datasets. :man_facepalming: Absent massive compute, this just isn’t feasible.

Question: Is generating bootstrap samples after imputation unwise / likely to lead to bias / or otherwise unsound from a statistical perspective? I ask because the obvious workaround to this problem is to simply impute the data once (and create 50 imputed datasets in the process), then generate the bootstrap samples and run the analyses 1000 times. Doing so would save me a ton of time and make this thing possible.

I can find no literature on point and am hoping y’all can help. Thanks very much for your thoughts and advice.


Very thought provoking question!

I’m just testing my intuition here.

We have the following scenarios:

  1. Random missing data
  2. Non-Random missing data

We have 2 courses of resampling and imputation

  1. Impute then Resample
  2. Resample then Impute

I’d bet that imputing before bootstrapping would be OK only if we can be safe in assuming the data are missing at random, at least in terms of bias. The variance might be higher than there would be as compared to the Little and Rubin approach because we are effectively bootstrapping with a smaller sample (ie. less precise imputation).

If the data are not missing at random, then all bets are off, and I suspect the recommendation of Little and Rubin is the most correct approach.

Are my intuitions sound here, or am I totally off base?


Scott, I stumbled upon a nice blog post and thread by Jonathan Bartlett on combining bootstrapping with multiple imputation:

The blog post mentions an article which I believe will be of direct interest to you: BOOTSTRAP INFERENCE WHEN USING MULTIPLE IMPUTATION by By Michael Schomaker and Christian Heumann. A preprint of the article is available here: The article itself was published in Statistics in Medicine in 2018:

The article uses both theoretical considerations and Monte Carlo simulations to investigate and compare 4 methods of combining bootstrapping with multiple imputation. The goal is to see which of these mehods yield valid inference when used in the context confidence interval construction, i.e. if the actual coverage level of the confidence interval equals the nominal coverage level.

The first two methods in the article are of the Impute first, Bootstrap next variety: perform multiple imputation first to resolve data missigness, then apply bootstrapping to each imputed data set. These are referred to in the article as MI Boot (PS) and MI Boot, where PS stands for Pooled Sample.

The last two methods are of the Bootstrap first, Impute next variety: apply bootstrap first (without resolving data missingness), then perform multiple imputation for each bootstrapped data set. These are referred to as Boot MI (PS) and Boot MI in the article.

For their simulations, the authors note that “the computation time for Boot MI was always greater than for MI Boot”.

Furthermore, the authors suggest that:

General comparisons between MI Boot and Boot MI are difficult because the within and between imputation uncertainty, as well as the within and between bootstrap sampling uncertainty, will determine the actual width of a confidence interval.

In the article conclusion, the authors state the following:

"The current statistical literature is not clear on how to combine bootstrap with multiple imputation inference. We have proposed that a number of approaches are intuitively appealing and three of them are correct: Boot MI, MI Boot, MI Boot (PS). Using Boot MI (PS) can lead to too large and invalid confidence intervals and is therefore not recommended.

Both Boot MI and MI Boot are probably the best options to calculate randomization valid confidence intervals when combining bootstrapping with multiple imputation. As a rule of thumb, our analyses suggest that the former may be preferred for small M or large imputation uncertainty and the latter for normal M and little/normal imputation uncertainty.

There are however other considerations when deciding between MI Boot and Boot MI. The latter is computationally much more intensive. This matters particularly when estimating the analysis model is simple in relation to creating the imputations. In fact, in our first simulation this affected the computation time by a factor of 13. However, MI Boot naturally provides symmetrical confidence intervals. These intervals may not be wanted if an estimator’s distribution is suspected to be non-normal."

The article also presents a real-world data analysis example where g-formula inference is involved (a setting similar to yours).

I hope this article will provide you with the guidance you need to proceed in your situation.


I agree that the Schomaker and Heumann paper is helpful, it informs my usual choice of taking a singlely imputed set and bootstrapping for inference.

A note that Daniel, et al.'s implementation of g-computation in Stata also uses the single impute / bootstrap approach ( and justifies by saying an “improper” imputation (taking the ML estimate rather than sampling from the posterior) is acceptable if one is not using Rubin’s rules for variance estimation. There is a citation there, but I haven’t followed up on it.


Robert, Isabella, and Jon:

I owe all three of you my deepest gratitude for your time and thoughts on this. By the time I posted my question (i.e., after a relatively extensive search), I assumed there probably wasn’t a well-informed answer to it. And yet, these resources are directly on point–so on point, in fact, that I’m not sure how I missed them. (In my limited defense, I wasn’t aware of the gformula command in Stata, but I should have found the 2018 paper in Statistics in Medicine!)

Regardless, this is outstanding. Thank you all again. Very much. You have no idea how much time you just saved me! I’ll do my best to pay the favor forward on this board.


1 Like

Hi Scott / all

In case it is of further interest, I recently posted a pre-print paper on arXiv with more about this topic,

If there are concerns about congeniality and/or model misspecification, we recommend bootstrapping followed by imputation. To overcome the large computational burden you noted, you can use a faster version of this, which is implemented in the R package bootImpute