I’m not sure that “is there a better test statistic/measure to use” is really the core issue here. There may be a fundamental mismatch between the model that’s been built, and the questions you’re asking.
If a model is built (trained) on a heterogeneous population but used on a different, more homogeneous population, performance is likely to be different than expected.
The model seemingly predicts good prognosis vs bad prognosis, but you’re wanting it to discriminate based on survival time.