Model Performance with Repeat Predictions

Hello, this is my first post but I have been enjoying learning from others on this site!

I am hoping to find resources on evaluating model performance for models that are meant to provide serial predictions over time, particularly in hospitalized patients. For example, Epic Systems has made sepsis prediction models that output a risk score every 30 minutes while the patient is hospitalized. I’ve seen two ways of looking at discrimination in this case (taken from the JAMA IM study linked below), but I’m not sure either represents a meaningful measure.

  1. Hospitalization-level AUC: using maximum score prior to an event for each patient.

  2. Time Horizon-Based AUC: treating each prediction as independent and looking just at events within a given time interval.

Example:

I imagine there is probably a way to calculate time-horizon-based AUCs that don’t treat each prediction as independent and therefore provide a more informative measure of discrimination, but I haven’t found many examples. Any help would be appreciated!

AUROC is not a very sensitive measure. For other measures see this. Also it’s not an accuracy index; it is a predictive discrimination index. I’d start with a series of calibration curves with different time horizons.

Thank you for your response! It seems like my original post wasn’t clear; I am not asking about added value of a new biomarker or other measure. I am interested in how to measure model performance in the context of repeat predictions (ie predictions are made every 30 minutes during hospitalization).

Calibration curves with different time horizons wouldn’t account for the fact that repeat predictions are not independent for the same individual (ie repeat predictions during a patient’s hospitilization). I was thinking of some kind of correction for discrimination and calibration measures that accounts for correlated observations.

Do you have a clear context of the implication of the predictions?

For example:
Patients predicted as positives will be treated with medications.

Followup question:
Do you have several decision points? Or is it just time zero?

Such a correction is needed for p-values and confidence intervals, not for descriptive “keep moving the time horizon” calibration curves.

You had mentioned AUC in the original post. It’s not a great choice in that context.

@Uriah the context that I was thinking about was the linked example of a model meant to predict escalations in care (Intensive care unit transfer, emergency response team evaluations, cardiac arrest, or death during hospitalization), so typically implemented with some kind of medical evaluation once a patient’s score reaches a certain arbitrary threshold (65/100). The model runs every 30 minutes so each prediction could result in an alert if the patient reaches the threshold.

@f2harrell, I would think even calibration plots would need some kind of correction. not just confidence intervals and p-values. For example, say patient 1 is very stable, hospitalized for a week and has no events; they would contribute >300 observations (assuming the prediction model is run every 30 minutes as in the given example). Say patient 2 is hospitalized one day and becomes critically ill and transfers to the intensive care unit (the outcome of interest), so they only contribute <50 observations. Creating calibration curves treating each prediction as independent would be misleading since patient 1 would contribute many more observations compared to patient 2 who actually had the event of interest.

Thank you both for engaging, I appreciate your time!

Problem is that probability threshold mustn’t be arbitrary, the probability threshold implies the prior preference of the patient/clinician regarding the tradeoff between treatment-harm and outcome-harm.

If the outcome is death during hospitalization and the estimated probability is 0.6, you don’t care about performance for probability threshold 0.65 - because no one will give up treatment under these circumstances.

I’m much less confident about my answer here, but I think that it depends on how you frame the problem and what you really care about - because it’s neither classic case for prediction (prognosis) nor detection (diagnosis), I see it more as a monitoring problem.

One way to look at it is to take the discrimination measures for each patient during his hospitalization period, discrimination measures are best for intuition (c-index/auroc function as helpers for narrative thinking) and/or dealing with resource constraint (let’s say a limited number of hospitalization beds).

Another way to look at it is to consider each prediction as a separate prediction with what I assume will be an extremely low prevalence, so you’ll have huge number of models to evaluate but you’ll be able to do so properly if the model spits out estimated probabilities and if you are using a reasonable probability threshold.

The 65/100 threshold is bogus and will lead to (1) suboptimal decisions and (2) alarm fatigue.

Thank you both again for engaging. I am not concerned with any particular threshold since they are arbitrarily chosen by those implementing the models (usually a hospital operations team) based on judgements regarding resource constraints. For what it’s worth, I made up 65/100 just as an example, not because of a judgement that it would be an appropriate threshold.

I am more interested in the evaluation of model performance, and my concern is that current evaluations (see linked JAMA IM example below of a Sepsis prediction model) don’t provide the most meaningful measures. Using maximum risk score during hospitalization ignores the information from individual predictions; while treating each prediction as independent overrepresent patients with long hospitalizations, while under-representing patients that experience the event of interest.

1 Like

My pleasure! :slight_smile:

If you are facing a resource constraint problem you shouldn’t care much about calibration nor probability thresholds as long as you assume no treatment harm.

Trying to enforce treatment/outcome-harm framework over resource-constraint problem will get you in to conceptual mess, because you are trying to enforce a heuristic that might be good-enough for an individual decision making (should I take the treatment?) but definitely not for an organization decision making (what combination of patients will provide the optimal allocation?).

You can’t arbitrarily choose a probability threshold, what you should do is explore PPV for different PPCR (risk percentiles) and compare them to the competing baseline model which might be the Prevalence as a random choice.

This is my take on the topic
https://predictionperformance.netlify.app/performance_evaluation_of_prediction_models.html#/lift-curve

And don’t validate maximum risk. Validate risk prediction using the dynamic process with updated estimates and time-horizon-specific calibration curves. If absolute calibration fails you don’t need to look at any other measure. EPIC is kind of famous for not having calibrated models.