first time poster  primarily a clinician (medical oncologist), but pursuing a parttime PhD to develop teaching materials for clinical prediction models (CPM) for clinicians.
I am looking to review a commonly used oncology predictive tool, and in the process of reviewing the published validation/calibration papers, I am finding a reasonable number are not using true observed endpoint (death), but an endpoint estimated using kaplanmeier. I cannot find a discussion of this in the evolving literature around reporting, validating, and calibrating CPMs that addresses this approach, and while it raises a nonspecific concern for me about the impact of censoring and potential introduction of bias, I am not knowledgeable enough how to delineate this concern explicitly, nor how concerned I should be. Any thoughts on this would be greatly appreciated.
I know that this post is from 2015, but the following articles are great:

Survival Outcomes: Assessing Performance and Clinical Usefulness in Prediction Models With Survival Outcomes: Practical Guidance for Cox Proportional Hazards Models  PubMed

Competing Risks: Validation of prediction models in the presence of competing risks: a guide through modern methods  PubMed

Counterfactual Predictions: [2304.10005] Prediction under hypothetical interventions: evaluation of performance using longitudinal observational data
One of the key concepts that Iâ€™m still trying to figure out is IPCW (Inverse Probability Censoring Weights), does anyone has a good reference?
Iâ€™m not clear on why this is a complex problem except when there are competing risks (and for that, state transition models are more logical IMHO). KaplanMeier estimator should play no role here because that is for the homogeneous case where patientspecific risk factors donâ€™t exist, e.g., older patients die at the same rate as younger patients.
By complex problem do you relate to validation of prediction model with timetoevent outcome or just IPCW?
Maybe I missed something. Censoring gets interesting/challenging when itâ€™s not random (KM and Cox assume random censoring), and for that case you might attempt to get debiased estimators by forms of inverse probability weighting.
If I understand correctly Unoâ€™s article on his version of cindex for timetoevent outcomes suggests the KM estimator for IPCW:
Sounds reasonable to me, yet  itâ€™s not weird to use variables in order to produce performance metrics?
I didnâ€™t see something similar while dealing with performance metrics for binary outcomes (Calibration, Discrimination or Clinical Utility). There is something â€ścleanâ€ť about using performance metrics that are agnostic to the original variables that were used for the development of the model (such as your original version of cindex).
Iâ€™m motivated to find a common language that might encourage me (and my colleagues) to avoid information loss in the context of performance metrics. Even if the model was trained only on a binary outcome it might be useful to use your version of the cindex because there are more usable pairs (event â†” event and sometime censoredevent â†” event).
The simplest and and most generalpurpose performance measure of predictive discrimination is the new adjusted pseudo R^2 discussed here and implemented in the R rms
package. This works any time you have a loglikelihood function and can compute the effective sample size, and corrects for overfitting. Pseudo R^2 handle censoring quite well. Of the four measures at the bottom of that web page I like R^{2}_{p,m} the best. In OLS this is almost exactly R^{2}_\text{adj}.
Thatâ€™s great, thanks!
More questions/points :

Some algorithms do not have a loglikelihood function, what about them?

I canâ€™t interpret anything with â€ślogâ€ť or â€śexpâ€ť, thatâ€™s why Iâ€™m not a big fan of Nagelkerke version of R^2 even though it is a proper scoring rule. Iâ€™m a fan of the cindex because I can tell a story that relates to the number, I use the cindex as a tool for exploration. For choosing candidate models I use Lift Curve for Resource Constraint, NB for treatment harm and Ralized NB for combination of the two.

Why would you consider R^2 measures as a type of Discrimination? Brier Score is considered to be â€śoverall performance measureâ€ť because it captures both Calibration and Discrimination and the same is true for R^2. Maybe we should have a separate subcategory for Explained Variation?
Some good points. I greatly prefer methods that have likelihoods. That way you can move between Bayesian and frequentist attacks on the problem and can handle more types of censoring elegantly. Lift curves are highly useful and work for most settings. R^2 captures both discrimination and calibration, but unlike the Brier score greatly emphasizes discrimination.
Too learn how to better interpret pseudo R^2 we need a compendium of examples where it is compared to the cindex and other measures. Itâ€™s worth the trouble because likelihoodbased methods are the most efficient/least noisy.
I tend to equate explained variation with discrimination but Iâ€™ll bet there are formal definitions that would help us.
Just to illustrate my point:
: Event
: NonEvent
The binary cindex compares only the couples with the full lines while the timetoevent cindex compares also the additional dashed line. I think that many times comparing between events is more important than comparing between events and nonevents, especially when we are considering prioritization for screening, MRI etc.