Converting a continuous output to risk score category and selecting the optimal number of bins

I am trying to convert a continuous measurement of a patient’s bone mineral density to a risk score which I will display to the user with the corresponding observed prevalence (observed probability of having low BMD) in a test population. I am wondering if there is a scientific way of determining the optimal number of bins (N).

  1. Am I required to perform some kind of calibration on the validation set to help determine this or is it as simple as “I want 10 bin”? If calibration is required, what metric helps determine optimal number of bins?
  2. Should I make sure each bin has the same number of samples or should it be equal ranges of the continuous score?


Binning should play no role in this context, and is counterproductive.

Can you elaborate or point me to any resource? I have a bone mineral density measurement but for regulatory reasons want to convert it to a score between 1-N. I want to display the probability of having low BMD (i.e. the observed prevalence in my test set) by the score.

by the way - you were right on this. The threshold didn’t generalize which is why I am now trying to provide probabilities in the form of a risk score. If I were to calibrate the model to provide real world probabilities wouldn’t I need to bin them in some manner?


I’m not thinking of a single best reference right now but the RMS course notes in the chapter or ordinal models for continuous Y covers this. It shows graphs and a nomogram for computing the probability that Y \geq y for many y. You can also do that with linear models if the residuals are normally distributed with constant variance. A key message is not to assume what values of y a practitioner will want to compute Pr( \geq y) or Pr(Y \geq y) for.