I used ‘rms’ to fit a first order Markov proportional odds model ( outcome is the severity of medication side effects observed at 10 time points) with 3 baseline predictors then used soprobMarkovOrd to calculate SOP. My question is how I can evaluate predictive performance? ( Some form of internal validation) . I thought about visually comparing the observed distribution of states at each time point vs. predicted ( maybe by assigning the individual to the state with the highest probability at each time point). Are there any references or suggestions about this ?
There is a reasonable literature about Calibration:
Unfortunately, not so much about Utility (NB) nor Discrimination.
You might be interested in the thread I opened, that poses some questions but not many answers
The main obstacle is that you need to reflect binary decisions implied by the state-occupancy predictions, while being explicit about your assumptions.
It’s quite trivial for the binary case: You want to take care of “True-Positives” while the implied assumption is that they are compliers.
What about “True-Negatives”? Assuming monotinicity of the treatment and possible harm you definitely don’t want to treat them.
How to go about “True-Competing” for competing-risks multi-state model?
You don’t want to treat them neither, they are definitely not compliers.
And you can go on for each state-occupancy, or maybe a specific path.
In my opinion this is what you should do, but the easiest approach would be to use calibration.
For discrimination I’d recommend something like these methods where it’s easy to compute measures on the repeated measuments by first focusing on transition probabilities. Then you can put it all together with SOPs and show the width of an SOP distribution.
Focusing on transition-probabilities might take-off the cognitive burden of the L in OLM models, so basically it’s just alot of performance metrics for ordinal outcomes?
Yes, and one very relevant one measures how much of the predictiveness comes from things other than time and previous state: a partial pseudo R^2 based on the likelihood ratio \chi^2 test for all the regular predictors.
Thank you. But wouldn’t this pseudo R2 be too conservative for evaluating the contribution of variables other than time and yprev?
since yprev will probably explain most of the variability.
Previous state and elapsed time explains most of the transitions, so that’s perhaps too easy a task. You can definitely compute a more comprehensive pseudo R^2 if you prefer though.
Sounds reasonable, but these types of metrics are not considered a type of “Discrimination.” Maybe they are worthy of a category of their own: “Explained Variation,” or just another case of “Overall Performance.”
I understand your deep dislike of discrimination performance metrics (even the one that you created — the c-index), but I think we should at least make an effort to find a good generalization that connects to Lift, PPV, and NPV.
This requires enforcing flexible resource constraints that turn any probability into a binary choice.
This can also be done in the OLM framework, but it requires deep thought from a clinical point of view about what is avoidable, in both longitudinal and ordinal terms.
We are already doing this for simpler models, but OLM requires deeper thinking about heuristics:
L perspective: The c-index prioritizes early outcomes over late outcomes, but early outcomes might be more of a lost cause. We can extract path-probability out of OLM (chance of a primary outcome with a grace period of 10 days) assuming this grace period might be helpful.
O perspective: We use exclusion criteria to omit prior-assumed lost causes; we can do the same for unavoidable paths according to prior knowledge but without throwing away the data.
The thought that gives me is to create a derived subject-matter-relevant measure and show predictiveness of that measure. For example, we can sum SOPs to get predicted means and relate those to some kind of ‘observed’ means.
Not familiar with “aj-estimate”. But if Y is discrete then doing traditional assessments of calibration accuracy for selected SOPs at selected times is a good idea.
It might be a good idea to have a flexible cutoff for the ordinal state, just like we do with flexible p.threshold (for Net Benefit), flexible resource constraint (for lift) and a flexible time-horizons (for time-to-event models in prognosis).
That will also allow you to make head-to-head comparsion with conventional prediction models (talking from my own experience!).
Thanks. I think that with many states the AJ estimators use the data too “categorically” and may lack power/precision. And the approach you’ve outlined requires too many decisions, I think.
To be sure I got that right, one option is to compare expected mean time in state Vs observed time in state for each subject at each state and report median error or correlation between the expected and observed?