Goodness of fit for probit model (Hosmer-Lemeshow)

First of all, sorry for the perhaps novice question, but it has been bothering me for a couple of weeks now.

Basically I have some data I wish to use on a few models (probit based) to see if they “work” on my data or not. So my data has a binary outcome, and if I use my data on these models, and put in the correct values that the models require, I get a predicted risk. My question is: How do I evaluate the goodness of fit for each model? I have been suggested (by quite a few, and even a reviewer) to use the Hosmer-Lemeshow test, but when I read about it, it seems like it should be avoided. I know it is a common test to be used in my field (I’ve seen it done a lot of times), but that doesn’t make it better, if it’s actually not a good tool.

So how would I go about this ?

Also, can it be visualized in some way ? I mean, since it’s binary outcome data, there isn’t a good way to pair a prediction probability with either a 1 or 0, unless I make groups and get the group estimates instead, and after that see if the regression line is similar to the unity line. But as you are probably all aware of, the grouping can be a problem. One can pretty much make it look as good or bad depending on the group size. I am not sure if this is even possible to do in a rightly manner ?

Thanks in advance.

1 Like

There are many issues going on at once:

  • Hosmer-Lemeshow is obsolete due to low power, arbitrariness, and not penalizing sufficiently for overfitting
  • The Hosmer-Le Cessie one d.f. test is better, as implemented in the R rms package residuals.lrm function (but this only works for logit link)
  • Directed tests (e.g., nonlinearity, non-additivity (interaction), omitted variables) are generally better and are certainly more actionable

A main reason that Hosmer and others felt that H-L is obsolete is that it depends on arbitrary binning of predicted probabilities, with the most typical choice being decile intervals. They showed that tiny differences in how different software packages defined deciles results in noticeable differences in the H-L \chi^2 goodness of fit statistic.

Also consider making a calibration plot without binning. My RMS book and course notes shows how to get bootstrap overfitting-corrected smooth calibration curves for the logit link. You can probably use the rms orm function for probit.

Out of curiousity what made you choose probit? It is very hard to interpret its coefficients.


Thank you for taking the time to answer.
I will try to look into your suggestions.

The reason for the probit model is basically due to this particular model being common within my field. One of the first, the idea was good, and it’s easy for others to use. However, the predictive power IS limited, and it only has one input variable that is only one piece of the total driver explaining the toxicity in this case. But again, it’s like a standard model that you should use/derive parameters for, and then you can move on to logistic models with several predictors after that.
It’s called the Lyman-Kutcher-Burman NTCP model if you should find yourself a bit interested :slight_smile: