Choosing a Classification probability threshold when using nested cross-validation

With nested CV: Inner loop for model selection, outer loop for performance evaluation. At what level can we optimize a threshold probability (vs. 0.5) to maximize sensitivity or specificity of the classifier?

Do we need to do nested cross-validation within the inner loop for model selection in the inner loop of this inner loop followed by threshold probability selection in the outer loop of this inner loop, followed by application of the model and probability threshold on the outer loop?

Unfortunately this approach is in every way inconsistent with optimum decision making. See https://www.fharrell.com/tags/classification

I completely agree and expected this response, but, what do you recommend as the best way to use this model-based predicted probability in the case in which it is not being used to make a decision, but is being used to stratify or select patients for clinical trials. Specifically, our team would like to use the model-based predicted probability as a stratification factor in a randomized trial or to potentially design a trial based on the model-based predicted probability such as selecting subjects for inclusion based on a threshold of the score.

If you had a batch of patients available all at one time you could sort them in descending order of risk for inclusion, to enrich the study with high-risk patients. This is better than crude stratification. Or you could do weighted sample selection of patients using continuous risks (properly transformed) as weights.

Taking the word ‘classifier’ out of the discussion and good things happen …

I see your points. But for a randomized clinical trials in which patients are prospectively enrolled, we do not have them all at one time to sort in descending order of risk for inclusion. Are these approaches practical for prospective enrollment into clinical trials?

To that end, what are your thoughts then on the advent of immunotherapy cancer trials among patients described as “PD-L1 positive” based on hard thresholds of PD-L1 expression at 1% or 5% or 10% levels for inclusion in the trial? What would you recommend as an alternative to this approach?

There are not many examples where a hard cutoff ends up being a good idea. First of all it is highly likely that the cutoff is wrong in a deep sense. Then there is the problem of heterogeneity within intervals around the cutpoint. Think about instead doing weighted sampling of patients where the chance of selection is higher in regions of the marker that you want to enrich the trial for. You can view this process as “which patients do we want to turn away”.

I would not use nested CV in the inner loop. Nested CV is, as you noted, a method for performance assessment, not model selection. For that reason, it should not be used in the inner loop of nested CV
because the purpose of that loop is model selection. Plus, your proposed procedure is very complicated and difficult to test/debug.

Instead, to select a decision threshold in the nested CV context, I suggest the following (which, in my experience, produces reasonable results):

  1. Choose statistic of interest. For example if your model should have > 90% sensitivity, then the desired statistic could be “max. specificity among models which have > 90% sensitivity”.

  2. Using all data, learn all models (presumably this is done using cross-validation), treating decision threshold as a hyperparameter. For example, vary the threshold from 0 to 1 with step 0.01 (this is
    cheap operation because you don’t have to retrain models). Then remove all models which have <= 90% sensitivity. Rank the remaining ones by specificity. This will give you a single best model, including decision threshold.

  3. Now run nested CV as usual: repeat the selection procedure 2) inside each inner fold. That gives you a single best model for that inner fold, along with decision threshold. Apply that to the outer fold data, which will give you sensitivity/specificity for that inner/outer pair. Repeat this for each pair and, by definition, you have the nested CV performance point estimate of the model selected in 2).

The key is that by treating decision threshold as hyperparameter, nested CV gives you performance assessment of a model with decision threshold. I believe this is what you asked for

This is inconsistent with decision theory.

Noted. But what to do if product requirements dictate sensitivity and specificity (a common scenario)? If you optimize something like Brier score or calibration instead, test may not meet the requirements even if it is theoretically optimal

Not being expert on such requirements, I would just offer that you want to get optimum predictions by optimizing the log-likelihood, then study sensitivity and specificity after the predictive instrument is fixed.