Critique of paper on generalizability of oncology trials

Orcutt, Chen, Mamtani, Long, and Parikh recently published Evaluating generalizability of oncology trial results to real-world patients using machine learning-based trial emulations in Nature Medicine, 2025. The paper is motivated by misunderstandings about generalizability in RCTs and incorporates some problematic methodology.

The paper is motivated by the following.

Approximately one in five real-world oncology patients are ineligible for a phase 3 trial. However, restrictive eligibility criteria alone are unlikely to fully explain the generalizability gap. A study simulating various eligibility criteria combinations in advanced non-small cell lung cancer (aNSCLC) trials found little variation in survival hazard ratios (HRs) for treatment. This suggests that other factors may be at play. An alternative explanation is that physicians selectively recruit patients with better prognoses … Consequently, real-world patients likely have more heterogeneous prognoses than RCT participants.

Aside from the high frequency of the problematic term real world patients throughout the paper (why not just say clinical practice population?), this is all OK. But the explanation for this is simpler than what the authors believe, and the remedy does not need anything nearly as complex as what the authors do, and may not even require any data other than the high quality standardized data that is already contained in the RCTs.

The core problems are

  1. RCTs use inclusion/exclusion criteria for a variety of reasons having little to do with generalizability. An RCT is not generalizable if prognostic factors interact with treatment, the interacting factors are not included in the model used to analyze the RCT data, and the distribution of interacting factors found in clinical practice differs from the distribution within the RCT. For detailed treatment interaction examples see this and for more discussion about generalizability see this.
  2. Other than the special case where linear models are used (e.g., when the outcome variable is systolic blood pressure), RCTs are not designed to estimate absolute effects; they are designed to estimate relative effects like hazard ratios (HRs) and odds ratios. Relative effects are capable of being constant over a wide spectrum of patients’ disease severity, age, and other prognostic factors. Witness the author’s statement about little variation in HRs.
  3. Absolute effects such as differences in life expectancy, differences in restricted mean survival times (RMST), differences in cumulative incidence curves, and absolute risk reduction are incapable of being constant if there is any within-treatment patient heterogeneity in risk factors. Though commonly presented in papers in the absence of covariate adjustment, none of these quantities are suitable for marginal estimates. Absolute treatment benefits are small for very low-risk patients.
  4. Much of variation in absolute treatment effects is explained by the simple math of risk magnification.

Risk magnification is the simple idea that patients who are at very low risk have little room for movement on an absolute scale (even though relative effects such as HRs can still be extreme for them). For example, absolute risk reduction must “slow down” as baseline risk approaches zero, otherwise the risk in the treated group would reach impossible negative values. Likewise there may be limited absolute movement in very high risk patients who are beyond reasonable hope. See this for the following graphs, which depict risk magnification for a proportional hazards survival model with HRs of 0.1, 0.2, …, 0.9 (top panel) and a nomogram for estimating patient-specific absolute risk reduction from a binary logistic model (bottom panel).

From an RCT that uses covariate adjustment, with the distribution of important covariates not being very narrow, one can estimate the entire distribution of absolute treatment effects over the enrolled patients, as done here. Given an alternate covariate distribution from the clinical population, one can use the RCT’s risk model to do likewise.

A key problem that could have been avoided, and which probably would have made the paper under discussion (and many others) unnecessary, is how RCT results are presented in the literature. Almost all authors make the serious statistical mistake of showing Kaplan-Meier survival curves by treatment group, without covariate adjustment. Like simple proportions, Kaplan-Meier estimates assume homogeneity of survival time distributions within treatment. Since RCT inclusion criteria are never a single value of all the prognostic factors, all outcome distributions are heterogeneous and Kaplan-Meier estimates are not meant to apply. RCTs should have been routinely presenting graphs that show how the results translate into absolute patient benefit, for a variety of patient types included in the study. The RCT results can also include estimated absolute benefits for types of patients not included in the trial. These absolute benefits estimates (e.g., differences in RMSTs) will properly have wide confidence intervals due to extrapolation.

Here are two graphs related to these points, both taken from here. For a binary outcome of 30d mortality, a logistic regression model adjusted for clinically-relevant covariates found no evidence for differential treatment effects, i.e., interactions with treatment. For the logistic model, all patients in this randomized GUSTO-I study of t-PA vs. streptokinase for acute myocardial infarction had the treatment variables set to streptokinase, then to t-PA. Both estimates are shown in the graph below, where the number of patients at each tiny bin of (SK, t-PA) mortality risk is color coded. One can see that there was a huge number of low-risk patients (in yellow) in the 30,000 patient comparison, but there was also a high (with respect to typical RCT sample sizes) number of high-risk patients.

When the two treatments’ risk estimates are subtracted, we get the distribution of absolute treatment benefits below.

A nomogram for computing such differences was already presented above.

When RCTs are analyzed in line with their design, the need for papers such as Orcutt et al is lessened, and the needed results may not require any “real world” data supplementation at all. Imagine as standard results in an RCT paper (1) estimated survival curves by treatment for 5 representative covariate combinations, (2) a graph relating baseline prognosis to increase in RMST or decrease in 5y event incidence.

Besides problematic motivation and incomplete understanding of generalizability of RCTs there are problems in Orcutt et al’s analyses. First, consider “trial emulation stratified by phenotype”. As an aside, this is an improper use of the word emulation since to emulate means to copy, and randomization is not being copied. The proper word is simulation. But more serious is the construction of artificial prognostic phenotypes by grouping of patients into 3 tertiles of estimated risk. The use of tertile grouping means that a researcher believes that a patient’s outcome and treatment benefit is a function of how many other patients are similar to her. This is not at all how risk operates. Risk operates on an individual patient basis (patients do not compete with each other) and should always be analyzed as a continuous variable. The authors’ results are replete with unexplained residual heterogeneity.

Second, time zero in the observational data was taken as the starting day of therapy. This is known to cause serious bias, which is why intent to treat analysis is so often used in RCTs. Speaking of bias, an important omission from the paper is analysis of the tendencies for various treatments to be used in the clinical population. This would have helped us understand limitations of observational data for estimating treatment effects. Estimating explained relative variation in treatment selection from a host of patient characteristics would have been helpful. The patient characteristics need to include insurance coverage and socioeconomic status, besides all the usual candidate risk factors such as number of previous treatments and detailed disease severity.

A major analytical problem is the use of AUROC as a primary measure of model accuracy. AUROC has nothing to do with accuracy and is just a measure of predictive discrimination, e.g., the ability to find low- and high-risk patient types. Accuracy entails such things as smooth calibration curves. Machine learning algorithms such as those used by the authors are known to frequently have serious miscalibration due to overfitting. All results presented by the authors rest on having nearly perfect calibration, yet the authors made no attempt at assessing calibration. Also, AUROC is not a very sensitive measure of added predictive value. Other methods should have been considered.

Finally there is the big picture item of how much can we trust the authors’ treatment effect estimates that are derived from (possibly overfitted) complex modeling of non-randomized data.

Whenever I see a new paper that has the term real-world or emulation in the title, I am prepared to expect hype. This paper did not allay my fears.

13 Likes

This is beautiful - was literally having a conversation along these lines yesterday with a collaborator. Very useful to have these ideas summarized so nicely here. Great suggestion also to replace “real world patients” with the much more accurate clinical practice population. Will adopt this.

By oncology standards, the Nature Medicine paper is actually a major step in the right direction because oncologists are unreasonably obsessed with higher order interactions termed “predictive effects” as opposed to focusing on main effects via risk modeling. As taught to us also by @Stephen, higher order interactions are progressively less important than lower ones for modeling patient heterogeneity. We have to start first with risk modeling. The Kidney Cancer Association accordingly recently released a consensus statement emphasizing this key point. This datamethods thread now provides excellent recommendations on how to optimize our risk modeling with a few simple but high yield steps.

Oncologists often look for “interactions” to identify “predictive” effects. As taught by @Sander, the term “interaction” gets used for several distinct phenomena:

  • Biologic interaction (synergy, antagonism, coaction): One factor changes the physical mechanism of action of another. This is what we are truly interested in but is typically discerned in the lab, not from clinical data. See for example Section 3.4 here which details the discovery of HER2, one of the most well-established “predictive” (and “prognostic”) biomarkers in oncology.

  • “Statistical interaction”: Change in a measure (of effect or association) upon change in a third factor. This is what oncologists typically are looking for when using the term “predictive”. More specifically they are looking for changes in HRs upon change of the “predictive” covariate.

To make things even more confusing, “Prognostic” effects can be termed in the literature as “risk magnification”, “risk modeling”, “risk score analysis”, “effect measure modification”, “additive effect”, “main effect”, or “heterogeneity of effect”. “Predictive” effects are also known in the literature as “effect modeling”, “treatment interaction”, or “multiplicative effect”.

The reification of statistical higher order “predictive” interactions into “biologic” interactions is a category error that actively hinders progress within oncology and is harmful to data-driven patient care. Biologic interactions are typically discerned in the lab. What oncologists call “prognostic” effects are what should mainly “predict” which patient could get treatment. To avoid confusion, a better term than “predict” may be “forecast” (following Nate Silver’s recommendation) which is meant to be data driven as opposed to the “predictions” associated with oracles and prophets. Nostradamus did predictions (likely using forest plots and classifiers) whereas the weather service does forecasts. It may thus be simpler for us to start replacing in the oncology literature the word “predict” with “forecast” to prevent confusion from the erroneous dichotomy between “prognostic” and “predictive” effects.

5 Likes

I often wonder how papers like this get published in high profile journals. What sensible clinician applies treats the survival probabilites from a randomised controlled trial as resonable estimates to apply to a broader patient populations? As you point out the primary parameter that is to be estiamted by a trial is the hazard ratio and the hazard ratio is generalisable. If you are interested in a comparison of survival probabilities in treated and untreated - and the clinical parameter of interest is absolute not relative risk - then you need to apply the hazard ratio to an estimate of survival probability in the patient group of interest.

2 Likes

@paulpharoah I think the answer to your question is that many (most? Almost all?) physicians use absolute estimates! I suspect most reviews, guidelines, and teaching materials in oncology are more likely to report median survival or response rates than relative measures. This could be a neat topic for a systematic review if it has not been done already.

Question for @Pavlos_Msaouel: would you not agree that biological interactions as you describe them would manifest in an RCT as a statistical interaction? Do you think this is true of PD-1/PDL-1 expression, for example?

I am trying to understand this better as it impacts clinical practice. Physicians commonly rely on a “significant” observed effect in a biomarker subgroup to justify adoption of the treatment for that subgroup. It may not always be wrong, but I tend to be skeptical especially when the overall trial has a null effect. The recent example of polatuzumab vedotin for diffuse large b-cell lymphoma in the non-GC subgroup is specifically the one I have been thinking about lately. I am happy to provide more context if any are interested in discussing it!

2 Likes

Nope, not necessarily. The two concepts of biological and statistical interactions can overlap but are not the same. We elaborate on this here, focusing on HER2 as a prototypical example. See in particular Section 3.4 and Figure 7. More tailored to a general oncology audience would also be Section 17. “Prognostic and Predictive Effects” here. For data science nerds, here is a cool related paper and datamethods thread on broader concepts this is all embedded to.

Note also that many of the statistically “predictive” biomarkers oncologists are seeing in clinical datasets are actually ultra-prognostic biomarkers. A great example of this is the beautiful story of ctDNA in muscle-invasive bladder cancer culminating with IMvigor011. Luckily, the prognostic effects of ctDNA monitoring were so powerful that just about everyone with ctDNA+ would recur and practically all with ctDNA- would remain disease-free post-surgery. Thus, the prognostic signal persisted even after using information-destroying methods such as looking for “predictive” 2-way interactions at the HR scale in forest plots of subgroup analyses. That may indeed be the indirect value of looking for these interactions: it can be a quick practical filter for “ultra-prognostic” biomarkers such as adjuvant ctDNA in muscle-invasive bladder cancer.

2 Likes

@Pavlos_Msaouel may cover this in his writings, but I want to add that commonly-mistaken-for-heterogeneity-of-treatment-effect inconsistences of absolute risk or RMST differences seldom represent biological interactions. Statistical interactions on a scale for which such interactions are mathematically allowed to be absent (e.g., HR, OR) may be approximately thought of as necessary but not sufficient conditions for biological interaction. Do you agree with this Pavlos?

2 Likes

Exactly! In theory (and practice) there can be a biological interaction not detected in statistical data (e.g., due to low power / precision) but it is generally fair to say that with enough sample size under most plausible scenarios statistical interaction on the OR and HR scales would be a necessary but not sufficient condition for biological interaction.

3 Likes

Thank you for this beautiful discussion. There are several points on which I would appreciate your perspective.

In the field where I have direct experience as an analyst - namely, liver transplantation as a therapy for colorectal cancer with unresectable liver metastases - there are substantial challenges in obtaining sufficiently large datasets. Stratifications and other attempts to model heterogeneity often resulted in very few events per variable, rendering models such as Cox regression or other defensible approaches too unstable to explore properly. Consider that the first studies on which clinical guidelines were based relied on datasets of around 20 patients.

Moreover, and I think this is a more general point, I’m increasingly convinced that a substantial portion of heterogeneity in treatment effects doesn’t stem from baseline covariates, but from dimensions that are rarely modelled: clinician skill and experience, the specifics of how an intervention is operationalized at a given site, patient adherence to complex peri-operative pathways, available instrumentation, and so on. Dahabreh and Hernán’s argument that what we call “isolated treatment” is actually a joint intervention, i.e., an entire bundle of actions and conditions. Randomization does not randomize the quality or nature of this bundle, nor are these components sampled from any underlying “population of implementation strategies;” they simply come as-is with the particular trial, which limits what the trial can transport about. Importantly, the joint intervention also includes the choices made by analysts regarding methods for estimating efficacy - and, as @Sander points out, different reasonable “independent” analysts may obtain different results from the same dataset. This seems particularly relevant in a world still very much afflicted by the ritualistic reification of “p < 0.05 = significant result.” Evidence triangulation could be an option (?).

Finally, the notion of an endpoint inherently involves a cost-benefit evaluation with wide margins of subjectivity, often reflecting patient preferences. Is it preferable to survive five years in bed or one year while enjoying daily activities outdoors? Clearly, there is no objectively “correct” answer. Indeed, endpoints implicitly encode value judgments about quality of life, burden of treatment, and what outcomes matter to patients. In my opinion, this unavoidable variability, combined with strong subjectivity, highlights the inadequacy of absolute dichotomous criteria - which assume an idealized regular world that does not exist in practice. I also think all this should instead encourage adoption of frameworks such as GRADE, which aim for greater contextual adaptability - though even here a nascent sub-movement risks concealing unrealistic assumptions under mathematical complexity.

4 Likes

I agree with most of your post, except for the following:

I’ve discussed GRADE in a number of threads, and it falls far short of what reasonable people would do in a community of cooperative skeptics interested in truth.

Empirical assessment (linked to in that thread) indicate that applying GRADE overvalues RCTs (granting them more credibility than they contain) and undervalues observational evidence, when the outcome measure is predicting what a future study would show.

The right way was discussed in @Sander in his paper on modelling bias in observational studies, which is inherently Bayesian in perspective.

Greenland, S. (2005). Multiple-bias modelling for analysis of observational data. Journal of the Royal Statistical Society Series A: Statistics in Society, 168(2), 267-306 https://academic.oup.com/jrsssa/article/168/2/267/7084313

Conventional analytic results do not reflect any source of uncertainty other than random error, and as a result readers must rely on informal judgments regarding the effect of possible biases. (my emphasis)

I pointed out in another thread that this “informal judgement” was actually formalized in a field of AI and computer science, known as Subjective Logic, which is an application of Bayesian hierarchical models.

4 Likes

I don’t agree with this small part of your excellent post. All of these things re randomized; it’s just that the “bundle” makes the final analysis harder to interpret.

2 Likes

Thank you very much. I was referring to differential post-initial randomization issues such as informative censoring, differential adherence, or other site-specific implementation differences. I don’t think we can assume that these components of the joint intervention are truly randomized in practice, as they arise after treatment assignment and may depend on ‘treatment-patient’ or even ‘treatment-clinician’ interactions.

1 Like

Thank you very much for bringing these issues to light.

I have also encountered fundamental problems - starting with the randomized studies on which the methods discussed here are based. For example, one major concern is the determination of utility coefficients, which seem to act as substantial compressors of highly complex multivariate information.

Another issue concerns the surveys used in randomized trials: has anyone assessed the stability of these results through repeated administrations? Has it been examined how responses change depending on factors such as wording or other relevant stimuli? Has anyone designed a questionnaire capable of capturing not single outcomes but truly multivariate ones (e.g., multiple benefits versus multiple adverse effects, each with varying degrees of “intensity”)?

I therefore apologize if my previous message gave the impression that I intended to present GRADE as a solution. What I meant was simply that I share the idea of personalizing therapies and grading evidence - although we are still far from having an adequate method to achieve this. In other words, I merely share its underlying philosophy. In this regard, I would like to ask those more experienced than I am whether they see any viable solutions or promising directions to pursue.

1 Like

Nice points in both replies. Related to ‘compressors’ we need to always analyze the rawest form of the data, then apply utilities. For example if the outcome is a 20-level ordinal variable we analyze it as such before applying what may be < 20 distinct utility values. From the fitted ordinal model we compute expected utility and confidence intervals for same.

[Alessandro just notified me of this thread] Regarding “…statistical interaction on the OR and HR scales would be a necessary but not sufficient condition for biological interaction” there is a technical problem because, in analyses stratified or conditioned on other risk factors (e.g., matched studies), additive ratio (“relative risk”) models, such as proportional hazards and conditional logistic models, do not provide the often-assumed correspondence between additivity and absence of causal interaction - see “Additive Risk versus Additive Relative Risk Models”, Epidemiology 1993:4:32-36. This is not a matter of approximation breakdown, but rather of intercept or baseline shift across strata or conditioned factors. One needs to go back to the risk (not ratio) scale to preserve the correspondence. For more discussion of the relations between biologic interactions and additivity see Ch. 5 of Modern Epidemiology 3rd edn. 2008 and VanderWeele’s 2015 book Explanation in Causal Inference.

5 Likes

P.S. Even if we modified the statement to “statistical interaction on the risk scale would be a necessary but not sufficient condition for biological interaction” (allowing that the risks can be approximated by odds or rates when the outcome and censoring are uncommon) it would still be technically wrong in that the condition is not strictly necessary: In theory superadditive (synergistic) and subadditive (antagonistic+competitive) interactions could cancel enough to produce risk additivity. While we would not likely expect perfect cancellation, there might be enough cancellation to make detection of the predominant difficult, and the data would not inform us that the cancellation occurred let alone its extent. This problem is just the interaction (2nd order) analog of the main-effect (1st order) fact that a statistically identifiable effect is only the residual left after cancellation of causal and preventive effects, and hence statistical null hypotheses refer only to perfect cancellation within levels of controlled covariates, not to the sharp null hypothesis absence of effects in all observed units.

3 Likes

Thank you very much for these valuable points, Sander.

The pattern you outline for interaction might be another instance of our natural propensity for reification, followed by an inversion fallacy.

We begin with strong theoretical requirements - for example, assuming data faithfulness so that a null-average result corresponds to a directional separation in a DAG, or assuming sufficient causal knowledge for additivity on the risk scale to connect statistical additivity with the absence of biological interaction. But in practice, these conditions often recede into the background - as (cognitively) subordinate factors - and get replaced by a convenient shortcut. As a result, we behave as if the observed statistical pattern (a null-average result, or additivity on a derived scale like odds or hazards) could directly reveal the underlying causal structure; specifically, we start from one-way implications such as ‘no biological interaction (plus strong assumptions) –> additivity on the risk scale,’ and then quietly slip into using the converse. Much as we are now observing in the GRADE movement, increasing mathematical complexity can bury the critical assumptions within a cognitive substratum that is quickly forgotten - often relegated to a supplemental file - so that, in the end, what survives is merely “the method,” without any further questioning of its structural foundations.

I think your point about baseline shifts and the possibility of cancellations among superadditive and subadditive components underlines this: what we observe may be only the net residue of such cancellations, not a transparent view of the mechanism. In that sense, the inferential difficulty may come from reifying a theoretical implication into an empirical diagnostic and then tacitly reversing it - treating ‘null average result –> null causal effect’ or ‘statistical additivity –> no biological interaction,’ as if they were reliable inference rules. This is precisely where the inversion fallacy can enter unnoticed, even though the original premise pointed in the opposite direction (i.e., from knowledge to plausible statistical behavior, not the reverse).

Of course, once new multi-evidentiary information becomes available, we can actively use those same patterns to challenge our prior assumptions - very much in the spirit of Box’s well-known diagram.

3 Likes

Well said by all. I tend to be very pragmatic about all this. Which link function has yielded the best fitting model and has minimized interactions in similar datasets? That would either lead me to make as my primary model a specific one if using a frequentist approach, or would help me in specifying a prior. There are general families of link functions that have an unknown parameter or two, and a Bayesian prior could emphasize parameter values that approximate best fitting link functions in similar datasets.

3 Likes

Being a pragmatist too, I have to ask: How would I find out what link minimized the number of parameters in “similar datasets” (beyond the parameters I am forced to include based on the background context and the analysis goals, such as confounder terms). I don’t have most such datasets and rarely see posted research reports mention how they chose their link - and if they did I doubt if I would find satisfactory their criterion for choosing that link. [I note that If there are products in the model then I would insist the model also contain all the main effects composing those products (absent strong evidence to omit them).]

Then too, doesn’t the purpose of the model matter for which link is satisfactory? In all the the data I’ve encountered, staying with the canonical link and then using fitted risks or rates from the model seemed to be the most efficient use of the data and my time, avoiding the technical difficulties of other links (such as the identity link for risks), absent strong evidence for some other link. The model is then only a stabilization device or smoother in the face of the sparsity of data across all covariate combinations, not a representation of the complex underlying reality (a view that recognizes how all models are wrong even if some are useful).

As Box (1980) noted, omission of terms corresponds to a prior point mass at zero for the coefficients of those terms, which is not a prior I ever have for contextually reasonable candidate terms. A pragmatist could respond that the omission means the resulting model is estimating a weighted average response at each covariate level, with the weights determined by the model form and the data frequencies of each level, as per results for misspecified models in White H. Maximum likelihood estimation in misspecied models. Econometrica 1982;50:1-9 and White H. Estimation, inference, and specication analysis. New York: Cambridge University Press, 1993.

If I am concerned that those averages could be misleading, an alternative to adding terms is to treat the model as an estimated (empirical-Bayes) prior mean for smoothing or shrinkage. For some technical details of that approach (inspired by the “pseudo-Bayes” approach in Bishop, Fienberg and Holland, Discrete Multivariate Analysis, 1975) see Greenland S. Multilevel modeling and model averaging. Scand J Work Environ Health 1999;25 (suppl 4):43-48 and Greenland S. Smoothing observational data: a philosophy and implementation for the health sciences. Int Stat Rev 2006;74:31-46.

3 Likes

I thought I was the only one who had this view especially after Miguel Hernan said in his book “What If” that “We do not consider effect modification on the odds ratio scale because the odds ratio is rarely, if ever, the parameter of interest for causal inference.“

I, a PhD student and post-doc once wrote a paper about this very point related to OR (and by extension HR) that avoided all technical jargon (I have no idea why biomarker people call interactive biomarkers “predictive” thus confusing everyone) and that contains only intuitive examples and logic here.

Finally, indeed most statistical interactions are artifacts of the sample and therefore without any need for scientific interpolation or justification one can clearly conclude that, while necessary, statistical interactions on the OR or HR scale are not a sufficient condition for biological interaction.

3 Likes

I agree fully with both of you but disagree entirely with Miguel Hernan on the OR point.

To @Sander 's points, these are in line with including a parameter for everything you don’t know. Ideally this would include a parameter that indexes a spectrum of link functions.

I think it is possible to be pragmatic and productive about goodness-of-link in a way that monitors the minimization of interaction effects, e.g. for a given link we can compute a partial adjusted R^2 for all two-way interactions.

As discussed here, deviance and AIC can be used to select links in the context of binary or semi parametric ordinal models, when one does not want to use the more Bayesian-feeling approach of continuously indexed link functions. In that link there is also an example simulating the sample size needed to choose the right link from data. As one would expect, larger sample sizes are needed to discriminate between logit and probit links. For many situations I would take as the universe of links the logit, log-log, and complementary log-log links.

As much as I like odds ratios I’m more drawn to the approach discussed here which was motivated by @Sander. Truly interesting quantities are conditional, and if we like absolute risks for decision making, the link function is just a means to an end. When the link function is not logit or the identity function (I would never use the latter), individual coefficient interpretations are dicey, but that’s not so important when translating results into decisions. I would quantify the effect of a single predictor, or the joint effects of a chunk of predictors, using adjusted partial R^2 and by contrasting histograms of predicted risk distributions with and without the chunk.

4 Likes