CART and Multiple Imputation

Hoping for guidance on the appropriate way to apply bagged classification trees after multiple imputation:

Here is the example:

Using 10 fold CV for performance estimation of a bagged model (i.e., bagging 1,000 classification trees) with missing data in the outcome and predictors that are MAR (MNAR cannot definitively be excluded), this is what I am thinking:

  1. Using the training data only (90%), perform multiple imputation to obtain 10 imputed training datasets.
  2. Using the test data only (10%), perform multiple imputation to obtain 10 imputed test datasets.
  3. Fit a bagged model to each imputed training dataset.

Here is where I am really questioning what to do next:

  1. Apply each of the 10 developed bagged models to predict the outcome in each of the 10 imputed test sets which would give me 10 predictions per subject per imputed test dataset (100 predictions total). My rationale for this is since unlike a logistic regression, with trees, we cannot simply average model coefficients across imputed datasets.

  2. Do either of the following (not sure which is correct):

  3. Using the 10 predictions per subject per imputed test dataset, select the most frequently predicted class (majority vote) within each imputed test dataset to have one prediction per subject per imputed dataset, then calculate the error within each imputed dataset, then average the error across the 10 imputed datasets.

  4. Using the 100 predictions per subject across all of the imputed test datasets, select the most frequently predicted class (majority vote), then use this single prediction to calculate the error within each imputed dataset, then average the error across the 10 imputed datasets.

  5. Repeat this process 10 times (i.e., 10 fold CV) and average the 10 errors to obtain a final estimate of model performance.

  • Some other thoughts:
    • Can we follow these same processes above for even a logistic regression or support vector machine such that we dont actually average model coefficients, but rather we apply each developed model from the imputed training datasets to the imputed test datasets?
    • An alternative thought I had was within each fold, stack the 10 imputed training sets, develop a single bagged model, and then test this single developed model on 10 stacked imputed test sets.

Of note, I do know that CART can handle missing data with surrogate splits, however, other models than CART will be utilized including SVM’s, penalized regressions, and I would like to apply a consistent approach in handling missing data for each model.

A few thoughts:

(1) I don’t think you should do Multiple Imputation (MI) separately, within each pass of the CV loop, on the 90% training subset and on the 10% test subset. MI is a prediction process on its own right, and this will create systematic differences in the fill-in-the-holes predictions developed off the 90% random sample vs the much smaller 10% random sample. In a sense, this would be adding superfluous noise into the reproducible/robustness/stability assessment that the CV process is trying to do regarding model performance.

So for each CV iteration, build the MI on the 90%, then apply it to both the 90% and the 10%.

(2) If you’re going to bag models to “select the most frequently predicted class (majority vote),” it’s better to avoid setting up situations where ties are possible. Go with an odd number of voters, e.g., 9 or 11 rather than 10.

(3) If you’re going to try multiple algorithms besides CART, you might want to add CART without imputation to the roster, since one of the advantages of CART is its ability to handle data with holes. But how well that works, for any given holey dataset, is an empirical question. Finding out if doing MI as a pre-step provides additional ROI, for the extra time involved, would be a useful thing to be able to say, even if those results might not be clearly generalizable to other work.


1 Like

Very good points

  1. Will add CART without imputation.
  2. Regarding majority vote, Ive discovered an alternative procedure that will reduce variance: Rather than taking a majority vote across trees, average the class probability from each terminal node to which the subject belongs and determine the class based on this probability.
  3. Considering your suggestion to: “So for each CV iteration, build the MI on the 90%, then apply it to both the 90% and the 10%.”, your approach suggests that if for example, I had a dataset to develop a model, and another investigator from another organization had a dataset to validate, the investigator doing validation would need to apply my MICE model to their data. Alternatively, applying MICE separately to training and testing sets suggests that this investigator would be able to apply their own imputation procedure (e.g., MICE model). I’ve been thinking about this for a while, but, I believe the former makes more sense because considering the future in which we wanted to apply our model to a single test case, we could not develop an imputation model on this single test subject.

re (3), it wasn’t my thought that the MI “results” are or should be baked into your final model. Only that the CV is trying to estimate the robustness/stability of the model predictions, which should include the added uncertainty that MI introduces because not all of the data points are known with certainty. Bear in mind the CV is an INTERNAL validation of expected generalizability and/or measure of optimism. So each CV iteration needs to build and apply MI, which will vary somewhat from iteration to iteration.

My experience with observational healthcare care makes me believe that data holes are almost always MNAR … i.e., a material portion of missing data reflect systematic issues in data collection or processing, and by implication, many of those causes, whether known or unknown, will be institution-specific.

So an external validation site/investigator should apply a “local” MI process to impute values for missings in its own data, before applying your predictive/classification model. Or perhaps validate only on observations that don’t have holes, as a pure test of your model, as opposed to a combined test of model+MI.

For predicting individual test subjects, the “MI predictions” have to be included in the “data preparation recipe.” Basically, you have a two-stage prediction model; first you generate all the MI predictions necessary, and second you run the scoring routine from your model.

One of the advantages of CART by itself, compared to any algorithm that needs MI to work on all observations, is that it simplifies & streamlines the data pre-processing if your model ever goes into some mode of production or regular usage. That could be a good trade-off, if the performance is close to your “best model.”