Orcutt, Chen, Mamtani, Long, and Parikh recently published Evaluating generalizability of oncology trial results to real-world patients using machine learning-based trial emulations in Nature Medicine, 2025. The paper is motivated by misunderstandings about generalizability in RCTs and incorporates some problematic methodology.
The paper is motivated by the following.
Approximately one in five real-world oncology patients are ineligible for a phase 3 trial. However, restrictive eligibility criteria alone are unlikely to fully explain the generalizability gap. A study simulating various eligibility criteria combinations in advanced non-small cell lung cancer (aNSCLC) trials found little variation in survival hazard ratios (HRs) for treatment. This suggests that other factors may be at play. An alternative explanation is that physicians selectively recruit patients with better prognoses ⌠Consequently, real-world patients likely have more heterogeneous prognoses than RCT participants.
Aside from the high frequency of the problematic term real world patients throughout the paper (why not just say clinical practice population?), this is all OK. But the explanation for this is simpler than what the authors believe, and the remedy does not need anything nearly as complex as what the authors do, and may not even require any data other than the high quality standardized data that is already contained in the RCTs.
The core problems are
- RCTs use inclusion/exclusion criteria for a variety of reasons having little to do with generalizability. An RCT is not generalizable if prognostic factors interact with treatment, the interacting factors are not included in the model used to analyze the RCT data, and the distribution of interacting factors found in clinical practice differs from the distribution within the RCT. For detailed treatment interaction examples see this and for more discussion about generalizability see this.
- Other than the special case where linear models are used (e.g., when the outcome variable is systolic blood pressure), RCTs are not designed to estimate absolute effects; they are designed to estimate relative effects like hazard ratios (HRs) and odds ratios. Relative effects are capable of being constant over a wide spectrum of patientsâ disease severity, age, and other prognostic factors. Witness the authorâs statement about little variation in HRs.
- Absolute effects such as differences in life expectancy, differences in restricted mean survival times (RMST), differences in cumulative incidence curves, and absolute risk reduction are incapable of being constant if there is any within-treatment patient heterogeneity in risk factors. Though commonly presented in papers in the absence of covariate adjustment, none of these quantities are suitable for marginal estimates. Absolute treatment benefits are small for very low-risk patients.
- Much of variation in absolute treatment effects is explained by the simple math of risk magnification.
Risk magnification is the simple idea that patients who are at very low risk have little room for movement on an absolute scale (even though relative effects such as HRs can still be extreme for them). For example, absolute risk reduction must âslow downâ as baseline risk approaches zero, otherwise the risk in the treated group would reach impossible negative values. Likewise there may be limited absolute movement in very high risk patients who are beyond reasonable hope. See this for the following graphs, which depict risk magnification for a proportional hazards survival model with HRs of 0.1, 0.2, âŚ, 0.9 (top panel) and a nomogram for estimating patient-specific absolute risk reduction from a binary logistic model (bottom panel).
From an RCT that uses covariate adjustment, with the distribution of important covariates not being very narrow, one can estimate the entire distribution of absolute treatment effects over the enrolled patients, as done here. Given an alternate covariate distribution from the clinical population, one can use the RCTâs risk model to do likewise.
A key problem that could have been avoided, and which probably would have made the paper under discussion (and many others) unnecessary, is how RCT results are presented in the literature. Almost all authors make the serious statistical mistake of showing Kaplan-Meier survival curves by treatment group, without covariate adjustment. Like simple proportions, Kaplan-Meier estimates assume homogeneity of survival time distributions within treatment. Since RCT inclusion criteria are never a single value of all the prognostic factors, all outcome distributions are heterogeneous and Kaplan-Meier estimates are not meant to apply. RCTs should have been routinely presenting graphs that show how the results translate into absolute patient benefit, for a variety of patient types included in the study. The RCT results can also include estimated absolute benefits for types of patients not included in the trial. These absolute benefits estimates (e.g., differences in RMSTs) will properly have wide confidence intervals due to extrapolation.
Here are two graphs related to these points, both taken from here. For a binary outcome of 30d mortality, a logistic regression model adjusted for clinically-relevant covariates found no evidence for differential treatment effects, i.e., interactions with treatment. For the logistic model, all patients in this randomized GUSTO-I study of t-PA vs. streptokinase for acute myocardial infarction had the treatment variables set to streptokinase, then to t-PA. Both estimates are shown in the graph below, where the number of patients at each tiny bin of (SK, t-PA) mortality risk is color coded. One can see that there was a huge number of low-risk patients (in yellow) in the 30,000 patient comparison, but there was also a high (with respect to typical RCT sample sizes) number of high-risk patients.
When the two treatmentsâ risk estimates are subtracted, we get the distribution of absolute treatment benefits below.
A nomogram for computing such differences was already presented above.
When RCTs are analyzed in line with their design, the need for papers such as Orcutt et al is lessened, and the needed results may not require any âreal worldâ data supplementation at all. Imagine as standard results in an RCT paper (1) estimated survival curves by treatment for 5 representative covariate combinations, (2) a graph relating baseline prognosis to increase in RMST or decrease in 5y event incidence.
Besides problematic motivation and incomplete understanding of generalizability of RCTs there are problems in Orcutt et alâs analyses. First, consider âtrial emulation stratified by phenotypeâ. As an aside, this is an improper use of the word emulation since to emulate means to copy, and randomization is not being copied. The proper word is simulation. But more serious is the construction of artificial prognostic phenotypes by grouping of patients into 3 tertiles of estimated risk. The use of tertile grouping means that a researcher believes that a patientâs outcome and treatment benefit is a function of how many other patients are similar to her. This is not at all how risk operates. Risk operates on an individual patient basis (patients do not compete with each other) and should always be analyzed as a continuous variable. The authorsâ results are replete with unexplained residual heterogeneity.
Second, time zero in the observational data was taken as the starting day of therapy. This is known to cause serious bias, which is why intent to treat analysis is so often used in RCTs. Speaking of bias, an important omission from the paper is analysis of the tendencies for various treatments to be used in the clinical population. This would have helped us understand limitations of observational data for estimating treatment effects. Estimating explained relative variation in treatment selection from a host of patient characteristics would have been helpful. The patient characteristics need to include insurance coverage and socioeconomic status, besides all the usual candidate risk factors such as number of previous treatments and detailed disease severity.
A major analytical problem is the use of AUROC as a primary measure of model accuracy. AUROC has nothing to do with accuracy and is just a measure of predictive discrimination, e.g., the ability to find low- and high-risk patient types. Accuracy entails such things as smooth calibration curves. Machine learning algorithms such as those used by the authors are known to frequently have serious miscalibration due to overfitting. All results presented by the authors rest on having nearly perfect calibration, yet the authors made no attempt at assessing calibration. Also, AUROC is not a very sensitive measure of added predictive value. Other methods should have been considered.
Finally there is the big picture item of how much can we trust the authorsâ treatment effect estimates that are derived from (possibly overfitted) complex modeling of non-randomized data.
Whenever I see a new paper that has the term real-world or emulation in the title, I am prepared to expect hype. This paper did not allay my fears.



