Hopefully this isn’t a woefully ignorant question.
I have been tasked with estimating whether an exposure X is associated with an outcome Y. The number of outcomes will likely allow me to estimate the intercept, along with 3-4 covariates. There are several known strong confounders (of which age is the most important) and a multitude of reasonably likely confounders. I am leaning towards using a propensity score as my data reduction method.
However - there are also missing data of likely moderately important to frankly important confounders (height, weight and cigarette packyears). The data missingness lends itself well to multiple imputation with chained equations (MICE), i.e. it fulfils the missing completely at random criteria, at least on the surface.
My question is the following: as I already intend to incorporate the variables with missing data into the propensity score, should I first estimate the propensity score for individuals without missing data, and then use the MICE procedure from Hmisc’s aregimpute() function to impute the propensity score for those who had missing data, or should I impute the missing data first, and then estimate the propensity score for all individuals? If the later, then how would I incorporate the variance inflation into the final estimate?