Progression in cancer trials

I would love to start a conversation regarding analysis of progression in cancer studies. I can see a few issues with the types of measure used and how they are analysed. For a little background, a typical study in my area might randomise participants to treatment or placebo; at fixed points, participants are scheduled to have scans to assess their lesions (e.g., at baseline, and 6 and 12 weeks post-randomisation). If the treating clinician suspects progression of the cancer (e.g., due to worsening symptoms), they may carry out an ad hoc scan as well/instead, as per their routine practice. Lesions on a scan are assessed using RECIST. This involves measuring up to 5 lesions (as a measure of tumour burden, with the argument that more than 5 contributes little extra information). They are measured according to a set of guidelines but basically end up as a sum of uni-dimensional measures of the lesions. In follow-up scans, the process is repeated and response, stable disease, or progression is determined based on certain rules relating to either baseline or the nadir (previous lowest sum) of measurements (e.g., a 35% decrease in the sum counts as partial response).

Issue number 1 (if you’ve not picked up any other issues so far!) is the categorisation of a continuous measurement and the use of percentage change from baseline. Issue 2 is that progression is often modelled using Cox regression (or perhaps other time to event analyses), even though we don’t know the precise time someone progressed (we just know it lies between two scans). This second issues (and I’m sure other related sub-issues) could perhaps be overcome through the use of interval censoring methods (I confess I am not overly familiar with them). But perhaps issue 2 is a distraction anyway, as it is only present due to issue 1. For the first issue, my concern is that clinicians use progression to make treatment decisions, e.g., sufficient growth of a tumour would suggest treatment is not working for the patient and will be stopped at this point (and I think may be related to treatment resistance - but note they generally do not rely on RECIST to make a judgement! Though lesion changes will usually play some role). Clinicians may therefore be hesitant to stray from an outcome that aligns closely with their thinking and understanding of benefit (with apologies to any clinicians who do not think like this!).

My thinking is modelling the tumour size may have better properties (assuming all along that progression is a useful measure, proxy to overall survival or not). However, I see two obvious potential problems: 1) saying 10% more people responded is probably more meaningful than saying one group had an average lesion sum that was 12mm smaller than the other. The second problem is how death would be handled - currently progression is usually a composite of progression or death, which I am certainly not claiming is perfect (or even generally sensible). But I believe the thinking is that death may be a missed progression (which speaks to interval censoring methods again), and, being honest, is a simple way to handle it…

I would love to hear any thoughts, and particularly from clinicians if I have done you any disservice!

If you want more info on RECIST, please see


I am so glad that you started this topic and raised several issues that have bugged me for some time. The current approach of emphasizing progression-free survival and of dichotomizing progression has major problems including requiring much larger sample sizes to achieve power or precision.

I think that many problems with clinical trial endpoints start with being unfaithful to the raw data. The raw data consist of weekly patient status: vital status and degree of tumor burden. This naturally leads to the use of a multistate transition model, and one that respects the ordinal nature of the progression from no tumor burden to mild, moderate burden to death is an ordinal longitudinal state transition model. See this video for my thoughts on this, then see this.

Once you make maximum use of raw data, including allowance for a tumor shrinking below some “threshold” then growing again, you can use an ordinal longitudinal model to estimate all kinds of clinical readouts, e.g.

  • Pr(tumor burden < y and alive at time t | Tx, baseline covariates X), for any y
  • Expected time with burden < y and alive | Tx, X
  • Expected time with burden \geq y or dead

As a practicing oncologist myself I concur with just about everything noted in the above two posts. See also some additional considerations in other threads in this forum here and here.

Certainly the limitations of RECIST provide plenty of research opportunities for methodologists. We made a small contribution here and have another one in the pipeline addressing more of the points made above by @sme.


Been awhile since I read it, but it might be of interest as a contribution from the Mathematical Oncology community:


Thanks all for your insights! Although a statistician, I am very much at the applied end and can see I have a lot to get my head around!

@Pavlos_Msaouel - I will have to spend some time on your phase I/II study as there are a lot of parallels with the work I am involved in (mostly oesophageal cancer). There are many components of the design that I’ve never used before, but it certainly looks like a really nice way of approaching the problem. I’m intrigued that you still used the RECIST classifications rather than the raw measurements of lesions (accepting RECIST also incorporates non-measurable lesions and new lesions) - do you recall if there was much discussion about this in the design stage? And (apologies, I suspect this is a stupid question) where do the utilities come from? Were they based solely on clinical input?

@f2harrell I also need to spend more time absorbing the information in your links, but it certainly looks like there is great scope for improving what we do. As per Pavlos’s example trial, we tend to have longer gaps between patient assessments (we’re usually talking 6 weeks to 6 months between assessments of lesions) compared to some of your examples, but I imagine the benefits of the methods you propose still hold (especially as, also as captured in Pavlos’s design, there is the possibility of toxicity or worsening symptoms at any point in time). Though I confess I’m not yet sure how ad hoc scans (occurring between the scheduled scans, and that may mean no further scans are required) would be incorporated, but hopefully this will become clearer as I go.

@davidcnorrismd - thanks as well; this paper perfectly captures something that I have (in a far less sophisticated way) been bothered by.

At a basic level, it seems like we could take a step forward by modelling RECIST categories, considered as an ordinal variable that incorporates death (and perhaps symptoms and/or toxicity - even if these are reduced to a simplified binary measure to get me started!). I would still like to explore if the raw measurements of lesions could be used rather than RECIST categories somehow, so would be interested to hear if anyone has tried this.

Thanks again!

1 Like

Yup. This was all discussed before that project was started and just about every point on this thread was noted plus one more practical point: that using the raw measurements actually requires less steps since we need those to estimate RECIST.

We chose to use RECIST as a practical trade-off because the focus of the paper is not to replace them. It is more efficient to focus and explore a few key changes. In fact, readers will notice other topics as well that could be improved, some we typically note in the discussion section.

Good question. Utilities are based on subjective trade-offs and come from clinicians (typically myself in our methodology papers). Choosing a dose is a decision and decisions always require incorporating subjective loss/benefit functions as we describe here.

Here is a practical example where we used such a function in an actual trial. Subsequent designs instead use utilities as a more efficient and transparent approach to encode these trade-offs.

I think it is a very intuitive and efficient idea. We have a methodology manuscript to be submitted soon that directly uses raw measurements and does not care about RECIST. I don’t think we will focus on comparing the efficacy of the raw measurements versus RECIST. We simply went with the raw data because it makes the most sense (based on everything discussed above) and we needed for this project to reliably connect longitudinal measurements with survival outcomes.

Thus, I think methodology folks should go for this comparison. In my mind this was always at least one interesting potential PhD project for a statistician graduate student. Probably more than one as there is a lot to say and explore.


I love this whole discussion, which solidifies my belief that analysis should stay as close to raw data as possible in almost all circumstances.

1 Like

I don’t have much to add to the interesting discussion evolving here, just pointing to an earlier discussion of why we generally use progression-free survival instead of time to progression. It’s not making any assumptions about the cause of death, only that death is also a bad outcome (which can bias assessments of time-to-progression).

1 Like

I’m a pharmacometrician and I have worked with the raw RECIST data in a number of my studies. I love this discussion. Here are some big challenges.

What we care most about is overall survival and quality of life. I’ll focus on overall survival, since that’s the most straightforward to measure. Often, progression free survival and the RECIST variables are either not good surrogates for overall survival, or we just don’t know whether they are good surrogates or not. Furthermore, in many cases, the therapy the patient receives post-progression greatly impacts overall survival, but this data is usually not collected. Buyse et al [1] cover these issue in their recent review, and I quote below.

“Even in the best-case scenario where a meta-analysis of randomized trials addressing a specific therapeutic question can be conducted to test trial-level surrogacy, the results may not apply in a future trial testing a different question, for instance, the effects of a new drug with a novel mechanism of action, since the direct and indirect effects of such a drug on survival may be substantially different than with historical drugs. The increasing availability of active treatments after observation of the surrogate may also negatively impact trial-level surrogacy. For example, in patients with advanced colorectal cancer, PFS was a reasonable surrogate for survival when fluorouracil-based therapies were the only available second-line treatments: the trial-level coefficient of determination estimated from 10 randomized trials conducted in 1744 patients was R2 = 0.98 (95% confidence interval [CI]: 0.88 to 1.0011). In contrast, a more recent meta-analysis of 22 trials conducted in 16,762 patients found a much lower trial- level coefficient of determination R2 =0.46 (95% CI: 0.24 to 0.6812). Note that the confidence intervals around R2 can be wide, which implies that substantial uncertainty will typically affect predictions based on surrogates.”

Another issue to keep in mind is that it’s the appearance of a new lesion that is often the most predictive of overall survival (see some of our work for example [2,3]) but unfortunately this is a binary endpoint with limited power. A joint analysis of all the data as @f2harrell suggests (including vitals, etc.) is appealing, but then to use this model in a predictive fashion, it’s important to understand how to these coefficients vary across indication and over time as the therapeutic landscape changes. I believe addressing these questions requires a collaborative effort pooling a lot of data. The FDA would be well positioned to explore these issues, and they scratched the surface in 2009 [4], though much has changed since then. Perhaps with databases like Project Data Sphere [5] it would be possible today to better understand this issue. It’s something I continue to explore in my own research and I’d be happy to discuss further. Thank you for raising this interesting topic!

  1. Buyse, Marc, et al. “Surrogacy beyond prognosis: the importance of “trial-level” surrogacy.” The Oncologist 27.4 (2022): 266-271.

  2. Stein, Andrew, et al. “Survival prediction in everolimus-treated patients with metastatic renal cell carcinoma incorporating tumor burden response in the RECORD-1 trial.” European urology 64.6 (2013): 994-1002.

  3. Mietlowski, William Leonard, et al. “Clinical importance of including new and nontarget lesion assessment of disease progression (PD) to predict overall survival (OS): Implications for randomized phase II study design.” (2012): 2543-2543. Publications/Mietlowski12_ASCOposter_TargNontargNew_OS.pdf at main · iamstein/Publications · GitHub

  4. Wang, Y., et al. “Elucidation of relationship between tumor size and survival in non‐small‐cell lung cancer patients can aid early decision making in clinical drug development.” Clinical Pharmacology & Therapeutics 86.2 (2009): 167-174.

  5. Project Data Sphere


The Buyse et al article is a major step in the right direction. Notice for example how nicely they use graphs to express their assumptions about the data generating processes connecting these endpoints. However, the field still needs to evolve into deeper thinking of what these endpoints estimate and how they connect with the random treatment allocation procedure in RCTs. This can have tremendous implications in oncology and beyond. Various stakeholders are gradually beginning to recognize this and there will be more discussions in the coming years.

Very nice work on the everolimus RCC cohort, and it is definitely related to the original post in this thread. Your team should keep building on these ideas!


Interesting discussion and don’t forget those non-target lesions that are recorded qualitatively. I’m keen on trying out the multi-state modelling approach - state for new lesion and then different states for the non-target leisons and then using tumour size from target lesions as a time-dependent covariate. It woudl be nice to just stick with the raw data as is and not process it in any way, In a lot of my studies death is usually a rare progression event but clearly it can be a state in the multi-state model. Where I start getting heart palpations is when we start correlating to overall survival…

There is a plethors of post-progression treatments patients get and nobody knows how the disease you have created ina previous line with a new drug is going to react to post-progression treatments you know even less about. I recall a Genentech study for Atelizumab in 2nd line NSCLC where Docetaxel was the comparator. They docuemnted the post-progression treatments and it was incredible what people got in the two arms. Interestingly it was possible to get Dcetaxel twice - which I was never aware of. So I shy away from OS becuase of these issues.


Those who get palpitations when thinking about overall survival estimates truly understand the problem. Ignorance is bliss.


There is a lot to explore here, location of lesions etc. and one I’ve wondered about… say you have a lesion in the liver - we collect liver enzymes longitudinally - so are they providing a signal as to how the liver lesions are doing between imaging visits?

I’m super keen to see your new methods paper on handling all the different imaging aspects.

1 Like

Yup, although the liver metastases would have to be very extensive for liver enzymes to start become elevated, at which point often patients do not have many more options and decision-making paths are constrained. But other non-imaging variables measured longitudinally in-between images and/or concurrently with the images that could plausibly influence outcome heterogeneity and thus be included as predictors along with the raw tumor burden data could be: 1) other blood-based markers like cancer serum markers such as CA-125, circulating tumor cells, circulating tumor DNA etc; 2) quality of life / patient reported outcomes.


Building on Dr Msaouel’s response about utilities, a health economist’s perspective is that utilities should be defined according to the perspective of the analysis; not prospective/retrospective as defined in epidemiology; but societal, health systems, patient, or other perspective of analysis, defined according to whose preferences health outcomes reflect. Economic evaluation textbooks provide discussion. Both Drummond et al. and Neumann et al. indicate that preferences from the general public are usually used in analysis, providing contrasting rationale that authors have used to defend general public versus patient-elicited values. Neumann et al. recommend those from the general public, recognizing situations where patient preferences are appropriate.

The Global Burden of Disease (GBD) Study is an example of the logistical feasibility versus theoretical challenges of measuring patient or population perspective utility, to the extent that disability and utility are similar concepts. Airoldi and colleagues show why disability and utility aren’t simply converse numbers, but that is less relevant to this point. Anand and Hansen challenged the process of using experts to define utilities for early versions of GBD estimates, which led to revised data collection methods described by Salomon et al. for subsequent versions. I don’t know what the causal factors were behind the change but think that decisions about what source to elicit utility data from should consider perspective, feasibility, biases, and study aim; recognizing that societal perspective is still advocated where possible.

-Drummond MF, Sculpher MJ, Claxton K, Stoddart GL, Torrance GW. Methods for the economic evaluation of health care programmes. 4th ed. Oxford: Oxford University Press; 2015.
-Neumann PJ, Sanders GD, Russell LB, Siegel JE, Ganiats TG. Cost-effectiveness in health and medicine. 2nd ed. New York: Oxford University Press; 2017.
-Airoldi M. Gains in QALYs vs DALYs averted: the troubling implications of using residual life expectancy. London: London School of Economics; 2007. Report No.: 0853280010.
-Airoldi M, Morton A. Adjusting life for quality or disability: stylistic difference or substantial dispute? Health Econ. 2009;18(11):1237-47.
-Anand S, Hanson K. Disability-adjusted life years: a critical review. J Health Econ. 1997;16(6):685-702.
-Salomon JA, Vos T, Hogan DR, Gagnon M, Naghavi M, Mokdad A, et al. Common values in assessing health outcomes from disease and injury: disability weights measurement study for the Global Burden of Disease Study 2010. Lancet. 2012;380(9859):2129-43.


I have a somewhat overlapping question which regards the prediction of pathological response (as per RECIST) - it may also be related to @Pavlos_Msaouel thread here.

Suppose you demonstrate that a biomarker accurately predicts pathological response to cancer treatment in a prospective cohort study.

How do you know if this biomarker is prognostic or predictive?

In other words, how do you tell apart prognosis from (relative) treatment effect?

Any reason to see the prediction of pathological responses “closer” to the prediction of treatment effects?

The resulting paradox is:

(a) if predicting response is mostly predicting (relative) treatment effect, then patients with a high probability of response should undergo the proposed therapy.

(b) if predicting response is mostly predicting prognosis, then patients with a low probability of response should undergo the proposed therapy. (worse prognosis → higher absolute effects)

Perhaps it’s just language, but I can’t help feeling (b) doesn’t make much sense. I think an ideal workflow would include testing an interaction term for relative efficacy in an RCT. In practice, it feels like biomarkers predictive of response found in observational studies are then tested in animal models to get a better idea about relative treatment effects – so the factor separating (a) and (b) would be biological understanding.


Good question. Our recommendation here, motivated by decades of literature on the topic, is to flip the narrative and define “prognostic” and “predictive” effects based on underlying causal considerations. This is how we nowadays model our analyses. The purely “statistical” definition of multiplicative interactions on the linear scale as “predictive” is a red herring and often produces noise.


Thanks! That makes much more sense. I’m halfway through the Cancer paper (I was just reading your related EUO editorial) and might come back with thoughts and questions if you don’t mind. I particularly like the way you framed the relationship between biological knowledge and treatment effect transportability. My biased perspective is that this is one of the important messages to improve the translation of omics/high-D research to clinically useful information.

1 Like

Exactly. This is a major motivation why we started framing things in these ways.

More than happy to discuss with anyone who has the fortitude to read that monster Cancers paper. There is another similarly long one coming up hopefully after this grant writing season ends. Never say never but I do not plan to write again such long manuscripts for the foreseeable future.

1 Like

We should have insisted on accurate terminology when prognostic and predictive were first used in this context. Predictive should not be used to pertain to effect modification in my view. It’s definition goes much earlier in time. I like the term differential treatment effect.

A side note: We should not speak about “_predictive” unless using data from a well-sized clinical trial. Observational data are seldom capable of providing an accurate average treatment effect, much less a differential effect.