Proper way to get an averaged calibration curve from nested resampling

Dear all,

I am working on some projects where the overarching goal is to assess whether a selection of machine learning algorithms perform better than “traditional” statistical methods such as logistic regression (with and without regularization) in terms of (binary) predictive performance, similar to this 2021 study by Austin, Harrell, and Steyerberg.

In my case, most relevant datasets range in sample size from ~500 to ~10 000 and the number of predictors from ~10 to ~50. I work with the Tidymodels meta-package and tune the models via Bayesian optimization. I use a nested resampling approach — k-fold cross-validation repeated n times for both the inner and outer loops (although possibly with different k and n for inner and outer) — to obtain estimates of predictive performance. For single value estimates, I can just populate a matrix and calculate an average, like so:

log_loss outer_repeat outer_fold
0.43 1 1
0.40 1 2
0.39 1 3
n k

And then:

mean_log <- mean(c(0.43, 0.40, 0.39 ...))

But — and herein lies my main concern — I also want to plot some kind of averaged calibration curve. My current solution is that for each prediction on held-out data in the outer loop, I first get model predictions:

test_pred <- predict(final_mod_fit, data_test, type = "prob") %>%
        bind_cols(data_test) %>%
        rename(prob_yes = .pred_yes) %>%
        select(prob_yes, outcome) %>%
        mutate(outcome_num = ifelse(outcome == "yes", 1, 0))

test_pred looks like this:

prob_yes outcome outcome_num
0.03 no 0
0.45 yes 1
0.67 yes 1

Then I use stats::lowess to get the x and y coordinates of the smoothed calibration curve, with the predicted probabilities and observed outcomes in test_pred as input:

cal <- data.frame(with(test_pred, lowess(prob_yes, outcome_num, iter = 0)))

I thus get one cal dataframe for each k in each n. After some processing to merge them all, I get a dataframe structured like this:

x y outer_repeat outer_fold
0.01 0.03 1 1
0.01 0.03 1 2
0.02 0.05 1 3
n k

I can plot this with an interaction term of n and k for color, like so:

p1 <- ggplot(results, aes(x, y)) +
    geom_abline(slope = 1, linetype = "dashed", color = "#606060") +
    geom_line(aes(color = interaction(outer_fold:outer_repeat)), size = 1) 

Now, finally, for the averaged calibration I just add a loess-smoothed curve:

p2 <- ggplot(results, aes(x, y)) +
    geom_abline(slope = 1, linetype = "dashed", color = "#606060") +
    geom_line(aes(color = interaction(outer_fold:outer_repeat)), size = 1) +
    geom_smooth(method = "loess")

The resulting plots (using a small example dataset) look like this:

My question boils down to whether this is a suitable approach?

My thinking is that with a large enough number of n repeats for the outer loop, just showing the averaged/smoothed curve for each model would suffice. Maybe also include the averaged GLM curve as a reference in order to compare it to ML algorithms. Still, I am very much open to suggestions if there are better approaches!

Finally, a bonus question:

The hyperparameter optimization stage takes a long time (~10+ hours for some models/datasets). What would you deem as an absolute minimum number of repeats of the outer loop? This will help determine if I can run this on a local workstation or if I need HPC resources.

If you’ve read this far, thank you!

This doesn’t address your questions but I like your approach. Besides averaging loess calibration curves it’s always good to plot all the raw ones going into the average. This is to see if there are any kinds of systematic calibration errors that the averages cover up. But just thinking out loud.