Least damaging way to convert model predictions to scores

Ahmed_Sayed · November 27, 2024, 9:12pm

Hello all,

I’ve been working on a problem where the task at hand is to risk-stratify patients for further testing (https://discourse.datamethods.org/t/auc-for-different-decision-thresholds-in-an-ordinal-model/20429). I’ve performed internal validation/decision analysis curves and model performance is quite reasonable. Employing the model for practical use will primarily be via a simple web application that users can access to obtain risk estimates given 5 predictors of interest (2 continuous variables modelled using splines and 3 binary variables [I did not dichotomize. They’re generally understood to be binary]). There are no interactions

However, I’m also tasked with coming up with converting my model’s predictions to some simple point-based scoring tool that can be used to assign risk scores on-the-go (e.g., without requiring access to the web app). I understand this will inherently lead to worse predictions compared to using the full-fledged model because some simplification will be necessary. I’m trying to figure how I can come up with the least bad scheme/heuristic that will do the least damage/distortion to the model’s output.

For one of my continuous predictors, the log-OR looks like ~ a straight line across most of its span and so a simple additive score (e.g. “add 1 point for every 5-unit increase”) would approximate it (I think).

For the other continuous, there is clear non-linearity with an inflection point (decrease in association) at the ~60th percentile. I could probably approximate its shape with 2 straight lines (like a linear spline) and so I assume a scoring rule like:
“Add 1-point for every 5-unit increase below X. Add 0.5 points for every 5-unit increase thereafter” would approximate it.

For the binary variables I assume something like “Add 2 points if Yes/No” would be okay.

The problems I’m trying to figure out are:

Is what I said above the right way to go about this?
How do I figure out what the least-bad X is in “Add X points for every 5-unit increase” for each continuous variable (likewise for each Yes/No binary variable)? I would assume the main idea would be something like converting their log-ORs to a scale of interest (say 0 to 100).
When quantifying the loss in predictive discrimination this simplification entails, what would the best approach be? Assess decrement in C-statistic? I know that comparisons of C are not the gold-standard to compare models so I’m assuming they’re not the best tool here either. Maybe visualizing the simplified scores on a decision curve vs the model’s original predictions?

Would greatly appreciate any feedback on this. I understand it seems awkward that one would want to kind of degrade their output in this way, but this is more or less a form of damage control.

robinblythe · November 28, 2024, 1:06am

Have you created a nomogram for the model? I think it’s fairly straightforward to do with rms.

JiaqiLi · November 28, 2024, 2:41am

I once read an article that documented some debates among researchers. Some argued that nomograms are too complex and that simple score tables are more suitable for clinical use, as many doctors and healthcare professionals don’t know how to use complex predictive models and are unfamiliar with nomograms. My memory is poor, and I can’t recall where I saw it. If I remember, I’ll update the link. Personally, I really like nomograms, even from an aesthetic perspective. However, almost no one around me has heard of nomogram. Whether it’s appropriate seems to depend on the situation.

Additionally, the likelihood ratio test and AIC appear to be acceptable methods for model comparison. For nested models, Professor Harrell strongly recommends the likelihood ratio test. I suppose it can also quantify the loss in predictive discrimination.

Update：I found this article which describes in detail how to present the prediction model. In it, I saw that the author compared the prediction probabilities of the simple score table and the full model at different time points when evaluating the model loss.

Tables 3a and b illustrate a points score system for the primary biliary cirrhosis example. In table 3a, scores are assigned to categories of each predictor, which have been scaled according to a 15 year increase in age (that is, the regression coefficient for age multiplied by 15); table 3b presents probabilities of the outcome that correspond to the points total. For example, an individual aged 55 (0 points), with cirrhosis (3 points), albumin of 34.4 g/dL (0 points), and central cholestasis (5 points) has a points total of 8; this corresponds to a probability of death of 0.46 at one year and 0.90 at three years. These figures are similar to the equivalent estimates of 0.44 and 0.89 for the risk of death at one and three years, respectively, directly obtained from the full model equation for this individual. These data suggest that the simplification led to only small changes in risk predictions from the points score system for this individual.

Ahmed_Sayed · November 28, 2024, 9:17pm

Thank you @JiaqiLi and @robinblythe

I wasn’t aware the nomogram function could take blrm/rmsb objects as inputs. It’s been super helpful!

The BMJ article was very helpful as well. Seems the approach they suggested (for a point-based scoring system) was to take the midpoints of continuous variables but to take liberty with the number of, and spacing between, categories to preserve as much information as possible (particularly with respect to non-linear associations, which might necessitate having more categories near one end of the spectrum).

For the scaling, the suggested approach was to scale the score in accordance with the log-OR of a variable of interest, such that each 1-point increase in the score would correspond to the log-OR increase per X-point increase in a given variable.

A nomogram seems like it would do the best job of preserving model accuracy (though perhaps with the downside that some users might find them a bit too complex for their taste).

davidcnorrismd · November 29, 2024, 9:47am

Is this risk score to be used in the field of Wilderness Medicine? Or in a war zone? cc @ESMD

ESMD · November 29, 2024, 1:49pm

I don’t know if the clinical problem you’re dealing with bears any resemblance to risk-stratifying patients with MGUS to decide who should get a bone marrow biopsy. If so, you might want to have a look at this thread:

Critique of a published prediction model

Ahmed_Sayed · November 29, 2024, 5:51pm

That thread was very helpful. Thank you for pointing it out! The clinical problem has a similar essence in that we’re trying to see if using a small number of predictors that are relatively easy/cheap to obtain can be used to defer a substantially more costly/time-consuming test.

Ahmed_Sayed · November 29, 2024, 6:09pm

Completely agreed in that I’d rather people not use coarsened predictions unnecessarily. Unfortunately, there is still a lingering desire for these simplified versions by practitioners (probably due to their ubiquity and thus sense of familiarity). I will make sure to note these limitations and link to the actual tool on the coarsened version.

f2harrell · November 29, 2024, 11:06pm

Steyerberg covers this in detail in Clinical Prediction Models.

Remember that simplifications should come at the back-end not the front-end, i.e., don’t make regression relationships simpler just for presentation. Point scores, nomograms, and line graphs can handle continuous predictors with any shapes.