There is a reasonable literature about Calibration:
Unfortunately, not so much about Utility (NB) nor Discrimination.
You might be interested in the thread I opened, that poses some questions but not many answers
The main obstacle is that you need to reflect binary decisions implied by the state-occupancy predictions, while being explicit about your assumptions.
It’s quite trivial for the binary case: You want to take care of “True-Positives” while the implied assumption is that they are compliers.
What about “True-Negatives”? Assuming monotinicity of the treatment and possible harm you definitely don’t want to treat them.
How to go about “True-Competing” for competing-risks multi-state model?
You don’t want to treat them neither, they are definitely not compliers.
And you can go on for each state-occupancy, or maybe a specific path.
In my opinion this is what you should do, but the easiest approach would be to use calibration.