Some clarification of this statement in the FDA document on PFS will be appreciated:
“Compared with TTP, PFS is the preferred regulatory endpoint… PFS assumes that death events are randomly related to tumor progression” on page 13 of the Clinical Trial Endpoints
for the Approval of Cancer Drugs and Biologics Guidance for Industry document (https://www.fda.gov/media/71195/download). In non-indolent cancers, death is in the natural sequence of progression, so it is difficult to square this “random relation” statement to the use of PFS as a surrogate end-point for OS in the first place. Will appreciate any insight into FDA’s guidance.
Some clarification of this statement in the FDA document on PFS will be appreciated:
This doesn’t answer your question but PFS, using time until first event from among death and progression, counts progression as bad as death. Does that make any sense?
Prof. Harrell thanks for your comment and this is my intuition about this endpoint.
As a follow up, will you agree to an approach that applies different censoring rules to death vs. progression in developing a PFS algorithm? Example, say during the course of a trial, a biopsy procedure is necessary. Will it be proper to count death right from the biopsy but restrict progressions only to after 100 days after the procedure?
From my intro survival class, censoring must be totally independent to the outcome of interest i.e. PFS in this case. I want to think that censoring probability should be the same for patients who later die or develop progressions during the course of trial. So a rule where right after the procedure, all deaths must be counted but not for progression within that 100-day window, to contradict the independence of censoring rule. Am I right or am I over thinking this? Thanks!
i guess they mean that conditional on progression at some time t the probability of subsequent death is some random distribution. i guess the argument is that if progession was perfectly predictive of death then TTP would suffice?
this issue of censoring is a very interesting topic because eg patients switch treatments etc. I think censoring in the OS and PFS analysis will be different eg the former with use last date known alive and the latter would use last assessment date(?) There is a phuse working group on estimands in oncology currently seeking vilunteers: PHUSE — The Global Healthcare Data Science Community
Thanks @pmbrown. About the question, conditional probabilities of death should make sense - would be helpful if you could share some references.
In the case of the death censoring, I was referring to rules for death and progression censoring in the same PFS censoring algorithm. I hope my follow-up is clear.
a random effects model is a conditional model, ie conditional on the random effects. You will find joint frailty models in the literature for example where a random effect forces some correlation between the event and death. Not very useful here tho
im not sure i really follow your point about independent censoring. I’m not sure how progession is defined but it might encapsulate eg treatment failure, switching etc and thus avoid censoring
In oncology, scenarios where intermediate endpoints such as PFS or DFS are more reliable than OS for the analysis of RCTs are becoming increasingly more common. See related post here and here is a scenario where PFS and not OS is the gold standard for proper RCT inferences.
The effect of subsequent therapies is a major reason why OS has structural biases that intermediate endpoints lack. It is a good problem to have. In aggressive cancers, or in the past when we had fewer treatments available, OS would happen earlier and be a more reliable endpoint to use under typical modeling assumptions.
It is also indeed a limitation of PFS and DFS that they treat other events as equal to death. However, in practice nowadays death often happens much later than other events in many cancers so that time to progression (TTP) strongly corresponds to PFS. Nevertheless, no endpoint is perfect and there is certainly much work to be done to address the limitations of OS and of PFS/DFS. Revamping our estimands in oncology, and the models we use for them, is a tremendous open methodological problem.
Thanks, @Pavlos_Msaouel. So methodologically, in the context of developing a PFS algorithm, will you consider a rule that censors some progressions but allows death as an anomaly since the whole idea of PFS is to treat death and PD as equally likely events?
See here an example of how we addressed progression by imaging as an ordinal outcome, along with clinical progression and death as time-to-event in an early-phase trial design.
Here we used a different approach but still treated progression as an ordinal outcome. And this just the tip of the iceberg as we focused more on other aspects and kept that part intentionally simple. There are far more elaborate ways to do this.
Thanks for the resources. I will surely review them and reach out if I need anything.
I’m still not getting why an ordinal state transition model isn’t the most natural way to go. Random effects and competing risks seem to be harder to interpret.
I agree and this is where we are moving towards as well.
PFS assumes that death events are randomly related to tumor progression.
The only way I can make sense of this is that they’re saying it assumes that the relationship between death and progression is the same on both arms (ie not dependent on treatment). That is, the % of deaths occurring before progression is observed won’t be systematically different between the arms. That won’t always be true (as in the contrived example I offer below), but I’m not sure it matters if we can simply agree that death is not a desirable outcome. PFS is not useful solely because it measures time to progression accurately (it may not); it is also useful because events accumulate faster than deaths, and because it is less affected by treatments delivered on progression. (There are also downsides of using a somewhat subjective endpoint in a setting where blinding is usually impossible, of course, especially when trialists and/or funders may care a great deal about the direction of the outcome.)
The rest of this post considers the biases inherent in TTP and the more general problem of censoring when there are competing events.
Death is (often) related to progression and it can happen before anyone has had a chance to observe progression. A treatment that is less successful in preventing progression/death, or which kills people before they have had a chance to progress, would end up looking better than it really is if we censor deaths.
Here’s a contrived example:
Let’s say the intervention is radical surgery and, in truth, it makes no difference at all to the risk or timing of progression. But the radical surgery kills x% on the operating table; they die on the day they were randomised. If we censor them at the time of death, both arms will do equally well according to TTP because we’re pretending the dead people never existed (or at least that their deaths had nothing to do with the treatment that killed them, which amounts to a serious but easily invisibilised violation of ITT for this endpoint).
It gets much worse if the increased risk of early death is related to (poor) prognosis. If the people who would have progressed early die early instead, the intervention responsible for the early deaths will look better according to TTP.
You’d hope, of course, that a large difference in early, treatment-related (or other pre-progression) deaths would be noticed. But small-ish differences can and do occur and it takes longer for mortality data to mature, especially if there are effective second and third-line treatments available. And, of course, the risk of publication bias increases if TTP makes a treatment look better than it really is.
In survival analysis, censoring people implies that, when censored, they were still at risk of the event at some time in the future. This is not true for dead people. The reason we count death as an event when we’re trying to measure progression is that it is both a competing event and also (like progression) a negative event. There is no need to insist that deaths and progressions are equally important (or independent of each other), only to acknowledge that dead people can’t progress and that both events are ‘bad’.
We get a similar (but trickier) problem when we try to estimate time-to-discharge. If one arm has a higher risk of dying in hospital, censoring at the time of death will make time-to-discharge look better because we’ve removed the dead people from the denominator for the ‘good’ event. A simple solution to this is to regard dead people as permanently hospitalised (that is, censor them at the end of follow-up, or on the date the data were frozen for analysis).
In-hospital follow-up creates another problem for estimating even the simplest of these outcomes: time-to-death. If people are censored on the day they are discharged (as would often happen by default with a naive survival analysis) the denominator for estimating the risk of death is reduced, making deaths look more common than they really are. If the follow-up period is sensible given the condition being treated, this should at least be apparent when the analysis concludes that virtually everybody dies. People who are discharged before the end of in-hospital follow-up need to be assumed (or verified) to be alive at the end of follow-up.
The RECOVERY trial of convalescent plasma for Covid-19 used the approaches to censoring described above (broken links because I’m only allowed two links in this post):
We used Kaplan-Meier survival curves to display cumulative mortality over the 28-day period. We used similar methods to analyse time to hospital discharge and successful cessation of invasive mechanical ventilation, with patients who died in hospital right-censored on day 29.
and, from the supplementary materials:
For the primary outcome [all-cause mortality], discharge alive before the relevant time period (28 days after randomisation) will be assumed as absence of the event (unless there is additional data confirming otherwise).
These methods are not perfect but they do have the advantage of being simple. Fine & Gray proposed an alternative approach: A Proportional Hazards Model for the Subdistribution of a Competing Risk. But interpretation is not simple.
Another trial of convalescent plasma for Covid-19, REMAP-CAP, took a different approach to the multistate problem. To assess organ-support-free days (OSFD) they assigned individuals to a category corresponding to the number of OSFD up to 21 days, with people who started the trial on mechanical ventilation (ie organ support) being assigned to a category labelled ‘0’ and people who died at any time during follow-up being assigned the label ‘-1’; people who remained free of organ support for more than 21 days were assigned ‘22’. These ordered category labels (which happen to look like numbers) were then used in a bayesian cumulative logistic model. This has some merits for comparing the two groups across multiple different states but the resulting OR (and medians) are hard to interpret.
Another broken link to that trial report: (jamanetwork…com/journals/jama/articlepdf/2784914/jama_estcourt_2021_oi_210114_1635806538.94872.pdf).
This is a useful review which covers the ground above, and more: Practical Recommendations on Quantifying and Interpreting Treatment Effects in the Presence of Terminal Competing Risks: A Review.
Bit long but I hope it’s useful. It’s a very interesting area and the FDA guidance far too terse.
Wow. Beautiful summary Josie.
In general competing risk results are hard to interpret. I can make more sense of state transition models.
Regarding the very real interruption of mortality risk by rescue therapy upon patient progression, the only other thought I had is to make all post-randomization treatments be outcomes using an ordinal state transition model where death is worst but the needed intervening treatments are ranked according to their degree of radicality.
State transition models are definitely a nice approach but I’d still like to see nice simple methods reported alongside them. Most (and in some cases, all) readers won’t be able to peer inside the “black box” to check what’s going on with more complicated models and they do allow for considerable “researcher degrees of freedom” unless very carefully (and rigidly) pre-specified.
Also, I’m working in systematic review and meta-analysis and I can’t do anything very useful on the meta-analysis side if the relevant trials use different methods that are incompatible with each other. Many trials do not have sufficient statistical support and struggle to do the simple things well.
I really like using concomitant/subsequent medication as an endpoint. But with something as complicated as second-line treatments for cancer (as opposed to, say, use of painkillers), the same considerations apply. We rarely have good enough information about the relative effectiveness of second-line treatments to be able to use them as anything other than a marker of progression (or evidence of pre-progression crossover).
The data we have is often not up to the complicated models we would like to fit in an ideal(ised) world. In the world we actually live in, we may just end up with a “very precise wrong answer” and attempted meta-analyses which can’t pool the evidence or even summarise it on a plot.
Hi Josie - the investigator degrees of freedom is a very real issue (not that the other ones you mentioned aren’t). In many ways longitudinal model exposes assumptions that simpler methods made but which are not obvious. One example: if you treat death as a competing risk in a time to recovery analysis you are effectively assumption proportional odds in a ordinal (well, sick, death) analysis. That’s because when the treatment affects death much differently than it affects the other endpoint, the overall treatment effect from the competing risk model is not interpretable.
@pmbrown About my independence censoring comment, I was referring to the key tenant of analyzing survival data which requires censoring to be non-informative and not to be associated with the underlying event time of interest. (Kaplan Meier Estimator and Cox prop hazards rely on these assumptions). So, following the scenario I provided above, if patients who die right after biopsy are instructed to be censored, that censoring is informative and will lead to biased PFS since death is a component of PFS. Here PFS is our endpoint of interest.
This is very likely true, but I think it boomerangs on the meta-analysis enterprise. The more sharply formulated our theories about new drugs and their mechanisms of action, and the more refined and specifically targeted our scientific investigations are, the less amenable they are to broad-strokes summarization. Any steps we take that advance biomedical investigations beyond the reach of meta-analysis are probably good ones.
Meta-analysis is used in a lot of different ways, of course. But I’m talking about systematic review and meta-analysis of RCTs aimed at informing clinical practice. We’re trying to quantify the benefits and harms for real people in real clinical situations. If it works in practice, maybe the underlying theory is right or maybe there’s a different underlying mechanism. That’s not a question we’re (directly) interested in. If, say, you want to use a cause-specific mortality to prove your mechanism of (beneficial) action, that’s fine. But we need all-cause mortality because the individual does not care what killed them; we’re not going to count the potential wins for a treatment while ignoring the potential losses.
Take the simplest example: a lot of continuous outcomes are appropriately reported as medians with IQR or range, instead of means with SD or SE, when the distributions are skewed. But for meta-analysis we need the mean and SD or SE (or for medians to be routinely reported with their SEs). Whatever else they do, we need authors to report a basic set of results that can be compared and, where appropriate, pooled, to allow a better estimate of the underlying effects and/or reliably identify heterogeneity which needs explaining.
It is almost always possible to do more with the data than the majority of authors do. And it is sometimes warranted. But if those results can’t be directly compared with the rest of the literature, they’re very hard to put into context and we end up making clinical decisions based on hand-waving, or have to expend massive resources to obtain individual patient data.
This is something I’d like the COMET initiative to take up. Their core outcomes (which every trial in that area is asked to measure and report) need to be accompanied by recommended methods of analysis and reporting (such as how to deal with competing events in a straightforward, minimally biased way that every author can apply). Not to tie anyone’s hands over whatever other methods of analysis they think are appropriate, but to make it possible to compare across trials so that we can make sense of all the evidence, not just the bits of it we feel like including.