Feedback on BayeScores: A Bayesian AFT Mixture Cure Model for Quantifying Clinical Benefit

Hello datamethods community,

My colleague, @Pavlos_Msaouel and I have been developing a general framework for quantifying the magnitude of clinical benefit in RCTs, aiming to move beyond some of the limitations of hazard ratios and rigid, threshold-based value frameworks.

Our approach, “BayeScores,” is a Bayesian AFT mixture-cure model. It decomposes efficacy into two interpretable components: the odds of achieving long-term survival (OR for cure) and the survival time-gain among the uncured (Time Ratio, TR). This seems particularly useful for adjuvant or curative-intent studies where non-proportional hazards are common and a “cure” fraction is a key outcome.

A key feature is an identifiability diagnostic (monitoring the posterior correlation of log(OR) and log(TR)). This is coupled with a 3-level prior system (Neutral, Skeptical, Strong Skeptical) on the cure component. This system allows formally encoding clinical belief (e.g., from “cure is plausible” to “very unlikely”) to help regularize the model and stably collapse it to a standard AFT formulation when the data are immature or show no real cure signal.

We’ve attached an example figure that shows this in action. Panels A-F visualize how non-identifiability (a strong negative correlation) at 2 years resolves as follow-up increases to 5 years. Panels G-J show how the skeptical prior helps resolve a mis-specified model, collapsing it to a stable AFT-only fit.

Finally, we map these components (OR and TR) to a continuous 0-100 utility score using a concave function to avoid the “cliff effects” of categorical systems and properly encode diminishing returns.

We’ve implemented this in an R package (using Stan) and have put the code and a detailed work example on GitHub: https://github.com/albertocarm/bayescores

We would be extremely grateful for any feedback, critiques, or advice this community might have on the statistical approach, the model specification, or potential pitfalls we may have overlooked.

6 Likes

This is impressive. Congratulations on taking this to a high level. My only suggestion is to couple the assessment of identifiability to the effective sample size of the posterior samples.

2 Likes

Thank you so much, Dr. Harrell. We really appreciate you taking the time to look at this and for the encouraging words.

That’s a fantastic suggestion. Coupling the posterior correlation assessment with the ESS makes sense. I imagine that severe non-identifiability would manifest as very poor chain mixing for those parameters. This would be a much stronger warning sign than the correlation value alone. We will look into incorporating that.

On that topic of the identifiability check, we had a follow-up question, if you don’t mind.

We initially focused on the posterior correlation as a clear diagnostic for data immaturity, for instance, when follow-up is too short to distinguish a true cure plateau from a long-surviving uncured patient.

However, we found it was also extremely useful for diagnosing model mis-specification. For example, when you apply a mixture-cure model to data where no real “cure” exists, the model struggles to distinguish a non-existent cured fraction from the tail of the survival distribution for the uncured. This also seems to produce a strong negative posterior correlation, signaling the model is ill-suited to the data.

Our solution for this was to use this correlation as a flag, indicating that a regularizing skeptical prior is needed on the cure component to stabilize the model. When applied, this effectively collapses the model into a stable AFT formulation.

For us, this has been one of the most gratifying discoveries, using the joint model’s posterior geometry to tell us when our assumptions are wrong.

As a clinician, I’m not sure how well-studied this specific point is. I’m sure it must be, but I haven’t been able to find abundant literature on using the posterior correlation as a specific diagnostic for model mis-specification in this context.

We were wondering what your thoughts are on this dual-use, and if you know of literature that discusses this? It’s been the most interesting component of the joint model for us.

1 Like

I’m not experienced in that area but the idea sounds terrific to me. And as aside I love the term ‘immature data’ in your context. One other random thought: see if you can relate what you are doing to posterior predictive checks.

2 Likes