Re “arbitrary binning,” perhaps we should do multiple calibration plots with varying number of bins, and eyeball the consistency (or lack of it) across the various choices. Most of the advice I’ve seen recently on Hosmer-Lemeshow seem to suggest it’s more suitable as a soft eyeball diagnostic than a formal hard test. (Or more precisely, that’s been my interpretation.)
The recent US News Best Hospitals analysis apparently used bootstrapping of H-L stats on small subsamples to overcome the problem that H-L is almost always significant, if the sample size is large enough. As far as I can tell, they didn’t cite a source for that idea, so the RTI team on the project may have cooked that idea up themselves.
Do you all think that’s a worthwhile step forward for calibration assessment?
Ignoring calibration because our measurements/tests are fuzzy doesn’t seem like a good approach. We’re kinda between a wet squishy place and a swamp on this.