Over the past few years my field (hem/oncology) has witnessed an outstanding surge in popularity of a surrogate endpoint called POD24 (progression of disease within 2 years of chemoimmunotherapy) initially described in patients with follicular lymphoma. Follicular lymphoma is in most cases an indolent type of lymphoma, and patients commonly experience long remissions between relapses. Clinicians are interested in identifying a surrogate endpoint for overall survival (OS), hoping this might expedite clinical trials. Beyond risk prediction, they are also interested in “risk stratifying” patients to guide patient selection in clinical trials (e.g. test intervention only in high risk group).

This is the first paper describing POD24: https://www.ncbi.nlm.nih.gov/pubmed/26124482/

Since this original description, many have echoed that POD24 is indeed prognostic, and others have also used it in other types of lymphoma.

Here’s some criticisms and potential problems I foresee:

- While it makes perfect sense that a shorter time to progression might predict OS (I have a feeling that this might be the case not only in follicular lymphoma but in any type of cancer), I question that the survival risk magically increases at 24-months
- Massive loss of information: after dichotomizing time, one cannot make predictions on time anymore…
- POD24 prompts a landmark analysis from the time of first event, with all the associated issues inherent to landmarking (e.g. drops patients censored prior to landmark event). In other words, this creates unnecessary problems.
- Clinicians may be tempted to use POD24 for risk prediction, although we know dichotomization is particularly harmful for individual risk prediction
- Researchers have started looking at predicting modeling using POD24 as the outcome variable, using an arbitrary timepoint (why not POD 20 or 36?), and thereby losing the time-to-event information.

Aren’t there better solutions to improve risk prediction and to analyze trial data than the arbitrary dichotomization of an outcome variable?

If someone could direct me to an open-access database of lymphoma patients (including both relapse and survival data), I would be happy to make my case by running some simulations in R.

I’m posting this to hopefully spark an insightful discussion around the use of this surrogate endpoint.

3 Likes

I agree. Similarly in solid malignancies people are starting to become interested in “landmark analyses” using 1- or 2-year survival probabilities and I’ve witnessed discussions to use those as primary endpoints for phase 3 RCTs. I think this is a bad idea for many of the reasons you have listed. The loss of information caused by using 2-year survival probability as the primary endpoint feels like the investigators are trying to shoot themselves in the foot.

3 Likes

I concur too, of course, as to self-inflicted injury of dichotomization. I note however Ascierto & Long [1] argue in favor of landmark analysis, on the very basis that “the use of median progression-free survival might underestimate the clinical value of a drug.” They seem mainly motivated by this consideration:

“As more drugs and combinations enter clinical development programmes, the landmark progression-free survival will be a more helpful benchmark to understand if a novel therapy has a place in the emerging treatment landscape.”

The figure from [1] also underscores this point:

A full working-out of this issue probably should address the motivations for interest in landmark analysis. Thanks to @drjgauthier for posting; I look forward to following this convo and learning more about this.

- Ascierto PA, Long GV. Progression-free survival landmark analysis: a critical endpoint in melanoma clinical trials. The Lancet Oncology. 2016;17(8):1037-1039. doi:10.1016/S1470-2045(16)30017-1 PMID 27324281

3 Likes

Good points. I am not aware though of statistical tests commonly used for primary endpoints that compare median PFS values, although we as physicians do place a lot of emphasis on such medians when reading papers. Most often proportional hazards models or log-rank tests are used instead for time-to-event outcomes. In the figure above proportionality is clearly violated so other models would be more appropriate.

1 Like

A recent paper that proposes a hypothesis test to compare medians is https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4793497/. However, one needs to be aware that to compare just a “slice” of survival curves (be it horizontal or vertical) has often (for not too malgin shapes of the curves) much less power than a logrank test. Note that a logrank test is always valid in the sense that it protects type I error, irrespective of how the curves look like. It might not have optimal power though once you move away from proportional hazards.

1 Like

Thank you very much for bringing this up! I have fought internally at my company “against” the promotion of this endpoint, for very much the reasons you mention! Beyond the question whether such an endpoint that dramatically reduces information can be used for any “prediction” (e.g. OS effect) In the Casulo et al paper, I have further methodological concerns about the following points:

They define a "risk-defining event’’ as follows:

- Reference group: patients with no PD or PD > 24 months, OS starting at 24 months.
- Early progressors: OS from PD for those with PD <= 24 months.

According to me, such an analysis is subject to immortal time bias? Also, what is the precise scientific question being answered with this analysis? The authors claim this to be conservative, as they shorten OS for the reference group by 24 months. Is this really true? Think about the following scenario:

- Mortality unrelated to PD.
- Mortality very high <= 24 months, negligible afterwards.

In this scenario, their approach could falsely claim PD@24 being predictive of worse survival: Those with PD would generally live *longer*?

As a possible alternative, the authors propose “Early PD as time-varying covariate”. However, to me it is not entirely clear how this was done. HRs are even larger than for the primary analysis?

Internally, I put together a multistate model for one of the largest RCTs run in FL from which I could then predict OS survival functions conditional on the response status at 24 months. This gives results that are comparable to those from a landmark analysis (most notably, the initial randomization is “lost” at the landmark timepoint), as discussed in https://www.ncbi.nlm.nih.gov/pubmed/18836831: “As expected the two approaches give similar results. The landmark methodology does not need complex modeling and leads to easy prediction rules. On the other hand, it does not give the insight in the biological

processes as obtained for the multi-state model.” Furthermore, the multistate gives me more flexibiltiy to get predictions as functions of covariates, e.g.

To conclude, not that other such endpoints are also explored, e.g. CR at 30 months: https://www.ncbi.nlm.nih.gov/pubmed/28029309

3 Likes

Very good point. When using the log-rank test (and its more general case: the score test for a proportional hazards model with covariates) for the purposes of frequentist null hypothesis testing, the assumption of proportionality is satisfied by the fact that the two treatments are assumed to be identical (the survival curves fully overlap with each other and the HR is constantly 1). In cases where that assumption substantially deviates from the truth, it makes sense that the operating characteristics of these tests would be suboptimal.

1 Like

Thank you all for your comments.

I think we can isolate here two separate issues:

**1. The use of POD24 as a surrogate for OS**

I found this interesting editorial in the Journal of Clinical Oncology by Fengmin Zhao.

https://ascopubs.org/doi/full/10.1200/JCO.2016.66.4581

Zhao reviews very nicely the evolution of the criteria for surrogacy since the landmark description by Prentice in 1989, and comments on a publication by Zer et al investigating the trial-level surrogacy of PFS and ORR for OS in patients with advanced soft tissue sarcoma.

Zhao seems to have similar concerns regarding the use of an arbitrary timepoint for a time-to-event endpoint:

" In general, time-to-event outcomes assessed at a single time point (eg, 3-month and 6-month PFR) are easily affected by the choice of time point, do not take censoring into account, may be misleading if survival curves are erratic (eg, crossover), and likely do not correspond well with the hazard of the same end point (PFS).[21]

The median value of a time-to-event outcome (eg, median PFS) presents similar problems.[21]

For trial-level surrogacy validation, the HR includes all data from all patients and is the most appropriate measure for time-to-event outcomes."

**2. The use of POD24 for individual risk prediction**

I think we can easily agree that POD24 might not be ideal for individual risk prediction, losing the ‘time’ information by dichotomizing across the 24-month timepoint. As a general principle, categorizing continuous variable is extremely detrimental to modeling and individual risk prediction.

1 Like