Is it possible to conduct phase III clinical trials in oncology if it is suspected that the survival curves will cross?

I have recently had an interesting experience. As I commented in a previous post in Datamethods ( we have recently seen some immunotherapy trials that are declared negative despite the likely existence of benefit. In some cases this was due to the presence of crossing survival curves, in trials analyzed with Cox models or log-rank tests, which were not able to capture the existence of possibly real differences.

What to do next in these circumstances? One possibility would be to repeat the trial with a more appropriate statistical approach, e.g. comparison of milestone survival rates at a late point. However, when we have proposed this approach, there has been some discussions, as it appears that this type of endpoint (% survival at fixed time points) might not be acceptable to regulatory agencies.

As I have understood, the point is that it would be controversial to assume that late benefits of some individuals can be a kind of trade-off, at the expense of the early harm of others. For example, sometimes the median survival of a whole group could be shortened, so that a few could become long-term survivors.
Therefore, the solution for the crossing curves could not go through the repetition of trials just changing the analytical approach, but the stratification by covariates, thus avoiding or trying to avoid that the curves cross. The problem is that in many tumours the predictive and prognostic factors are not really so defined. What it seems to me is that in study protocols it is forbidden to assume that this can happen, even if you can never rule it out…
I also can imagine that if the final benefit (difference is OS rates) is very big, it might not be so far-fetched to design the study even assuming the crossing curves. The problem is when the difference is not so big.
In the end all decision making is a trade-off to the extent that uncertainty is inherent in our work.
Since it is a subject that combines ethics, philosophy and statistics, I would like to ask you for a philosophical reflection. Is it adequate to try to replicate a study with crossing hazards? What do you think?


Maybe my dissertation is complex, nobody responds for that reason.
Direct question:

  • Prior phase III trial concludes that OS curves cross, but in the end there is a survival difference of around 15% in favour of the experimental treatment.
  • Non-significant result according to the log-rank test/Cox model.
  • So, is it appropriate to repeat the same study with the same design, eligibility criteria, etc., just by changing the endpoint (e.g., milestone OS % instead of the hazard ratio) ?

A quick reply to make a general point. It’s really important that we think about ways that standard statistical methods can fail, and try to use methods that can address the important clinical questions. I don’t think it is ever right that a statisical method determines the question that a trial addresses (that’s the tail wagging the dog). So I would say, sure, we should do these trials, and it’s the statisticians’ job to find ways of extracting understanding from the numbers. Others can comment more authoritatively than me on methods applicable to crossing survival curves.

Two other points:

  1. This example also illustrates why pre-specification of analysis plans can be a straitjacket. Sometimes there are good reasons to deviate from the plan.
  2. To address the specific query: I wouldn’t think repeating the trial with a different outcome measure was sensible. If sugfficient data exist to give a good answer, just reanalysing seems better to me.

I obviously agree with you. However, that was not exactly the focus of my concern. My concern arises from a real discussion a few months ago with immuno oncology experts within a large cooperative group. The problem with the reanalysis that you propose is that the agency will not accept it as evidence in a pivotal trial. The consequence is that for years patients are not going to receive a treatment that extends 2-year survival by 15% compared to the standard. Faced with this, my impression, I don’t know if wrong because the matter was not explained in this way, was that the sponsor did not agree to write a protocol with a methodology focused on the foreseeable existence of non-proportional hazards, since this would imply assuming that the effect of an experimental drug could fluctuate (e.g., hazard ratio >3 in the first six months, benefit only from the tenth month). This could also be conflictive at the regulatory level, they said. The problem is that the only solution they found at that time was to stratify by covariates to minimize the possibility of crossing hazards, once the different prognosis of the patients had been taken into account. This seemed fine to me, but deep down I was left with the feeling that it didn’t solve the problem at all. Since the predictive factors for immunooncology in general are often uncertain, the proposal to stratify seemed to me more like a way to stay reassured, and not to assume the fundamental problem of immunooncology, which is inherent to all the studies. There are patients who are called rapid progressors, or who make immunomediated hyperprogressions, and cannot be identified at this time. Therefore, the crossing hazards seems to me something that really wants to be put under the carpet, and the statisticians of the sponsors continue assuming that the effects are constant even though they well know they are not.
I understand that this is a thorny and nebulous issue, and it depends on each individual case, but my impression is that long-term survival is something far more valuable to the patients than a narrow median difference.

This shows that you have posted a political problem, Alberto, rather than a statistical one. See §4.1 of Jessica Flanigan’s book, Pharmaceutical Freedom.

1 Like

The rationale for the study design is unquestionable. If anything, crossing curves supports the need for a traditional, parallel design randomized, double blinded, and placebo controlled trial. The aim: estimate as many quantiles of the survival curve for two continuous “on treatment” groups in control and active arms as possible. By all means, admit other arms with cross over, dose escalation, randomized withdrawal, sky’s the limit, but don’t forget the hypothesis. Cox models and log rank tests are in some sense equivalent: the score test of the univariate Cox model is a log rank test. So I’ll refer to Cox models WLOG.

The analysis is challenging. The Cox model has power and interpretation that depends on the proportionality of hazards. You can have non-crossing survival curves that are non-proportional hazards. It’s an artifact of the natural parametrization of survival in terms of the log-hazard. The Cox model can be seen as weighting earlier survival times as the eye would perceive in terms of the Kaplan Meier curve. We look “long” out in the “t” (time) on those curves, but the majority of power and inference happens in early survival where many subjects are at risk, and there are many early failures, this is especially the case in onc studies.

Immunotherapy has spawned a number of biological hypotheses about drug and efficacy: are there subgroups of responders who will improve estimates of ATE? Is the effect of immunotherapy delayed so that in the early phases, subjects die more quickly as if they were untreated before the drug finally starts working? Etc. Etc. All of these are very interesting, but if there’s question about who to treat you have to step back to Phase II designs. If there’s questions about what the treatment goal is that boils down to the specific analysis.

For the latter, the g-rho-gamma family of survival functions are a promising generalization of the Cox model that can weight later survival times more heavily. Suppose for instance, you are handed a diagnosis and your oncologist wants to plan a treatment strategy with you. People receiving Treatment A have a 2-year median survival. 70% of people receiving Treatment B die within 1 year, but the remaining 30% live for 10 years. Which is the right treatment. The patient can choose. But a g-rho-gamma Cox model is exactly the type of test to administer to show superiority of treatment B provided we know that’s the actual mechanism. This is intended to dovetail with @davidcnorrismd answer that this is in fact a political question.


What could a “survival difference of 15%” mean?

1 Like

Some questions are statistical, but not all (learned from my good friend Phil Lavori, Stanford). Your question (as some have said already in other ways) is perhaps less a one of statistics and more about decision making, esp. what endpoints to assess and then the weighing of risks and benefits when a trial does not collect data on all relevant endpoints from a stakeholder’s perspective. The FDA is well aware of these distinctions and in the most recent annual forums of the Medical Device Innovation Consortium, a member facilitated a discussion of how patients could inform such discussions and even help by participating in study design issues like this one. They illustrated with a project in Parkinson’s disease.


Thank you, I found all your comments very interesting, even though I admit that my question was more philosophical than statistical. I thank you for that.
Most of my fellow oncologists understand and accept the possibility of such trials, but the fact is that the sponsor statisticians and other team members do not agree. I think people do not want to admit the possibility of transient harms at some time periods. However, this is not pragmatic because tradeoffs are happening in oncology all the time. We treat many patients to benefit some of them.
So by way of summary, I think I have learned the following from you:
In the case of crossing curves it may not be a bad idea to take a step back and go back to phase II studies in order to optimize the population with more possibilities of benefit.
This might not solve the problem in all cases. Phase III could again contain fluctuations in hazard rates between groups.
Ultimately patients should be involved in the design of such studies. For example, I would randomize myself in a study in which I can lose two months of median survival, but instead gain a 15% probability of long-term survival. However, this is questionable and would require some sort of discussion with patients.
As far as statistical tests are concerned, I am not sure that Flemming-Harrington family tests are perfect for crossing curves. The problem I see is that here the effect is not only late, but is reversed in the initial phase. Maybe Renyi versions could be conceptually better?
Finally, I believe that if it is impossible to delimit the prognostic and predictive factors of immunotherapy, and in spite of the efforts, the curves cross in a stubborn way, for me the only possible endpoint is the milestone survival rate at a point far from the KM curve (24 months can be ok for some tumors). One author (Tai Chen) proposes to use a double primary co-primary endpoint, the length of OS and the milestone OS rate at an intermediate point (as surrogate). However, it seems to me that later milestone OS rates may be sufficient as a single endpoint. I think Professor Harrell also suggested using two tests, one of them specific to non-PH. Would the FDA/EMEA accept as endpoint the 2-year milestone OS rate in a clinical trial if the preconditions are met? I don’t know, because I haven’t really seen many studies that have employed this design. But I don’t see why not. In any case, thank you all very much for your comments.


For example, 40% vs 25% OS rates at two years… some of them maybe long-term survivors…

Some immunotherapies in some cancers have a delayed treatment effect. Some recent protocols are taking that into consideration when calculating the sample size and analysis plan, usually by assuming HR = 1 for a period of time (e.g. 9-12 months) and proportional hazards after that period. My guess of why this was not considered in some published studies may be because a) this was not expected when the trial was designed for that specific drug and cancer (only some have delayed treatment effect), b) those trials were designed years ago, before this was known, and maybe they couldn’t change the analysis plan after seeing the data to avoid risking FDA submission. It is just a guess.
I raised this question when designing a study recently but the CPI said that it is not expected for the specific immunotherapy and cancer of that study.


Indeed, immune checkpoint inhibitors have a delayed treatment effect that is consistently seen both in preclinical models and in clinical trials. The Breakthrough documentary very vividly describes how this whole immunotherapy field was almost shut down because traditional oncology trial designs did not take this into consideration, and how tough it was to change the statistical analysis paradigm within industry and the FDA to account for what was seen in mouse models and eventually in humans.


It’s part of the explanation, surely, but I’m not convinced that what you’re saying is the whole explanation. I think there are also a number of deeper problems:

  1. The most used endpoint is overall survival, and there is no consensus on what value to give to the percentage of survivors at fixed points. Thus, some authors are reluctant to use the milestone survival rate if it is not clear that it is a good surrogate of global survival, or propose to use two primary co-endpoints, being one of them the overall survival, and the other the OS rate at fixed points, spending two degrees of freedom in the analysis.
  2. However, when the milestone survival rate is evaluated at points of time farther away from the start, for example, at 2-3 years, one can speak of the percentage of long survivors. I believe that not enough thought has been given to the possibility of using this endpoint in studies, and I believe that at the present time it is rejected both by agencies and by other stake-holders, without a true critical analysis. There may be several reasons for this, one is that so far we have been concerned to increase a little the survival of patients, but for the first time immunotherapy offers the opportunity to offer long survival to a few. This advance however has not yet brought a change of mentality on the use of endpoints. If today a trial is proposed using the 2-year OS-rate as the only main objective, it will possibly be rejected, but this may or may not be substantiated.
  3. I think sponsors, and the guys who pay, are horrified to acknowledge the pure truth, that drugs do not benefit everyone equally, and even for some they are harmful, nor do they have a constant effect over time. This is so controversial that they don’t even want to see it in a protocol, even though it’s the inescapable truth. What they want is a simple measure, such as the hazard ratio, which is comforting, even if it is false in these scenarios.

The issue remains far from resolved. People still don’t grasp the importance of increasing the % of long-term survivors, and designs continue to focus on increasing the length in months of life by a small amount. The documentary seems very interesting.

Yes, it is worth watching. BTW, I believe the correct term is 2-year OS probability and not rate. Hazards are rates and can exceed 1.0, whereas the milestone events you are referring to cannot. Good discussion here.

There is a curious example in the literature, the ADAPT trial. In April, closing was announced. It fell short, they said.

The only thing that fell short was the follow-up. In September, the researchers, begging at every door, get it kept open to evaluate the long-term effect, given that 50% of the patients were still alive.

I’m not sure about the current status of this trial because I’m involved in other tumors. In seems like closed study for lack of effectiveness NCT01582672, but I’m not sure because say it is still open.
It would be interesting to follow this story.

1 Like

That trial is closed for enrollment but there are still patients alive who received that regimen in our department. It is a different approach than immune checkpoint therapy, and can be further developed in the future using either AGS-003 (rocapuldencel-T) or similar agents.

1 Like

And why would the OS rate at a remote point, perhaps 2-3 years or more, not have been a good primary endpoint in this study? The Cox model seems to be a bad option there…

There are limitations (previously discussed here and here) with using 2- or 3-year survival probabilities as primary endpoints. Proportional hazards, while traditionally useful in chemotherapy trials, should indeed be a questionable assumption in immunotherapy research. There are multiple factors to consider when analyzing data from such RCTs, including challenges with causal inferences. In most cases, what’s needed is good statistical modeling using reasonable assumptions. I wish it was always as easy as comparing survival probabilities at fixed time points. Overall survival inferences, for example, are affected by subsequent therapies and this can confound inferences. See here for the related concept of dynamic treatment regimes (DTRs). We are currently analyzing a DTR-based renal cell carcinoma trial where patients were randomized for both first-line and subsequent therapies and even that is not an easy task inferentially. The need for good modeling cannot be overemphasized.

In my opinion there is a nuance. In the case of lymphoma what is raised is the use of a subrogated variable, the rate of progression at 2 years, for a disease that has a much longer survival (much longer that 2 years). The goal there is the use of an intermediate endpoint, with the aim of not following all patients so long. This is controversial indeed, probably wrong. But that has absolutely nothing to do with using long-term survival rates, proportional to the realistic life expectancy for that disease. Successive therapies to the first line may indeed be a problem in some cases, but not in aggressive neoplasms with <10% of survivors at 2-3 years. It is an issue that can be weighed in indolent tumors or with multiple treatment options, but it is not always important in orphan tumors.
I do not understand why you do not agree with milestone survival rates in a remote point, as a possible valid primary endpoint. Maybe it’s because of my ignorance, but I don’t think I can conceive another endpoint more reasonable in some specific scenarios.