I’m a statistician but not a missing-data expert. I’m advising some collaborators on employing multiple imputation to analyze a clinical neuroimaging study.
As an illustrative example, consider the EMBARC study https://www.ncbi.nlm.nih.gov/pubmed/27038550
a) Neuroimaging features from EEG or fMRI data are available at baseline
b) Clinical symptoms are collected pre and post therapeutic interventions.
c) Question: Scientists are interesting in using neuroimaging features to predict change in symptom scores.
d) The caveat is that there are missing outcomes post intervention due to drop-out which we plan to impute.
e) My suggestions to incorporate imputation are variations of the following
- i) Use multiple imputation to impute outcomes several times
- ii) For each imputed dataset, conduct machine learning/predictive modeling as usual (superlearning/stacked regression of multiple learners, high dimensional regression, etc…) to predict outcomes per imputation
- iii) Combine predicted outcomes per imputation.
I can’t find seem to find any discussion in the missing data/medical statistics literature for combining predictions (as opposed to coefficient estimates) in this way. It strikes me as easier to pool predictions rather than to pool the model coefficients. Are there any suitable alternatives to pooling predictions? Do you see any problems?