Hoping for guidance on the appropriate way to apply bagged classification trees after multiple imputation:
Here is the example:
Using 10 fold CV for performance estimation of a bagged model (i.e., bagging 1,000 classification trees) with missing data in the outcome and predictors that are MAR (MNAR cannot definitively be excluded), this is what I am thinking:
 Using the training data only (90%), perform multiple imputation to obtain 10 imputed training datasets.
 Using the test data only (10%), perform multiple imputation to obtain 10 imputed test datasets.
 Fit a bagged model to each imputed training dataset.
Here is where I am really questioning what to do next:

Apply each of the 10 developed bagged models to predict the outcome in each of the 10 imputed test sets which would give me 10 predictions per subject per imputed test dataset (100 predictions total). My rationale for this is since unlike a logistic regression, with trees, we cannot simply average model coefficients across imputed datasets.

Do either of the following (not sure which is correct):

Using the 10 predictions per subject per imputed test dataset, select the most frequently predicted class (majority vote) within each imputed test dataset to have one prediction per subject per imputed dataset, then calculate the error within each imputed dataset, then average the error across the 10 imputed datasets.

Using the 100 predictions per subject across all of the imputed test datasets, select the most frequently predicted class (majority vote), then use this single prediction to calculate the error within each imputed dataset, then average the error across the 10 imputed datasets.

Repeat this process 10 times (i.e., 10 fold CV) and average the 10 errors to obtain a final estimate of model performance.
 Some other thoughts:
 Can we follow these same processes above for even a logistic regression or support vector machine such that we dont actually average model coefficients, but rather we apply each developed model from the imputed training datasets to the imputed test datasets?
 An alternative thought I had was within each fold, stack the 10 imputed training sets, develop a single bagged model, and then test this single developed model on 10 stacked imputed test sets.
Of note, I do know that CART can handle missing data with surrogate splits, however, other models than CART will be utilized including SVMâ€™s, penalized regressions, and I would like to apply a consistent approach in handling missing data for each model.