I have a question about appropriate methods for obtaining uncertainty in the c-index (AUROC) with clustered data. I am working with a medical study where we are following subjects over time and examining the relationship between EMA data and a primary (binary) outcome. We are employing lasso regression that includes subject-specific intercepts and slopes (as well as overall averages). I am interested in using AUROC for interpreting model accuracy (not for model comparison).

In assessing the uncertainty in AUROC, I have several aims:

Account for the impact of uncertainty in model parameters

Account for the impact of uncertainty in Y (in the validation)

Account for effect of clustering on uncertainty

Account for optimism due to overfitting

All of this needs to be done in the context of a small number of individuals (N=15). I have seen the following methods.

Stata has a somersd package that computes CI in Somerās d (which can be linearly related to c-index) when there is clustering, but this wouldnāt account for (1) and (4). Just for clarity we are using R for the analysis.

Anyways, we are considering an approach of leave one (subject) out CV for constructing predictions for each individual across all their repeated observations followed by Jackknife style estimation of SE in AUROC (e.g., use all but one validation subject to construct AUROC, and repeat, then use Jacknife variance estimator). Wanted to get thoughts on whether this is a good idea and alternatives that might be recommended.

Unfortunately this is a moot point. The smallest sample size for estimating only the intercept in a binary logistic model is N=96. For a continuous response N=70 is required just to estimate the residual variance. There is no way to have predictors with N=15.

Minor note: c-index is not an accuracy measure; it is a measure of pure predictive discrimination.

Bayesian modeling with random effects for clusters will give uncertainty intervals for c. These do not account for sampling variability but just for the uncertainty in regression parameter estimates.

The cluster bootstrap can be used to get an overfitting-corrected c-index.

Random intercepts and slopes are not that likely to fit serial correlation patterns we see in longitudinal studies. See this.

I agree with your assessment. The goal right now is to use pilot data to design the study and map out methods. The eventual goal is to make accurate risk predictions for intervention prioritization. The discrimination measure I think is helpful in determining how much benefit we potentially can have by prioritizing interventions. Thanks for the note on accuracy measure vs. discrimination!

Given this goal, would you say study design based on relationships in the literature is a better approach than using the pilot (noting that we would have to make use of a different outcome that we thought had similar characteristics)?

When you mention that the Bayesian modeling does not account for sampling variability, I assume you are referring to sampling variability in the binomial distribution (e.g., error variability), correct, rather than sampling variability in model predictions (p-hat)?

For the cluster bootstrap, I assume you are thinking of computing the optimism bias in each bootstrap sample here by computing difference AUC on boot sample - AUC in full sample, using only model from full sample, seem reasonable?

Thanks for the comment on random effects. A bit more. We are using lag1 data and the slopes are based on lag1 times, but intercepts are indicating overall avg. We are also planning to add a lag1 predictor on the primary outcome. Not sure if this changes your thinking? If not, would you suggest Markov model?

Planning on comparing models with Brier score and bootstrap uncertainty (e.g., to suggest individualized model improves or does not improve predictions). Seem reasonable?

āThe smallest sample size for estimating only the intercept in a binary logistic model is N=96ā. For binary log regression, do I need more than 96 observations if I have 1 predictor? Is there a reason for 96?

One more question related to this. Regarding validation in this context using the cluster bootstrap and subtracting the optimism. How would you recommend properly assessing the accuracy assuming the training only occurs up to the point prior to the observation time (e.g., respecting temporal ordering for each subject). I think the optimism adjustment is accounting for too much overfitting if we just extract the fitted values from the model fit to the bootstrap sample.

Iām thinking that within each bootstrap sample, models are trained by removing each subject individually, and systematically adding their data back in using an expanding forecast window. Anyways, what we are looking for is a methodology for estimating what the accuracy will likely be in the real-world setting. In this case, averaging across the amount of prior data available for the individual in the sample.

Is this making any sense? And if so - any thoughts?

See here. N=96 gives a margin of error of 0.1 (not great!) for estimating a probability wen there are no covariates. Adding covariates requires having more than an intercept isnāt he model so that needed sample size increases.

Iām not familiar with that procedure. Perhaps others can comment. The bootstrap is not made to require individual subject-level exclusions other than the sampling with replace phase.

Iām not either. I think the more general question here is how you obtain optimism adjusted discrimination metrics with uncertainty when the model is subject tailored and updated continuously across time. I think this is relatively straight forward with leave one out cross-validation for direct accuracy measures, but I donāt think it works for the discrimination measures, because the interpretation is different depending on whether you apply it to one subject (within subject AUC) or 10 subject (more weighted to a between subject AUC).