Heterogeneity in RCTs with continuous outcome

Hello all,

I’m writing here a plea for literature recommendations or theoretical guidance. My research group work in RCTs with continuous outcomes (e.g. symptoms of depression) where the typical presentation of results is a mean difference (average treatment effect) with a 95% CI. But something I haven’t seen reported (at least not in my field) is a measure of treatment effect heterogeneity.

When you identify substantial heterogeneity in a random-effects meta-analysis, it’s typically more valuable to explore the sources of that heterogeneity rather than calculating a pooled mean difference. This is because the latter ‘averages over’ important information and is of limited clinical utility. I cannot see how the same logic would not apply in an RCT with a highly variable treatment effect.

I can imagine the value of presenting heterogeneity in an RCT would be that, like meta-analysis, it would shift the goals of your analysis. If your trial found low heterogeneity this would support the validity of reporting an ‘average treatment effect’. The average treatment effect is roughly what clinicians can expect to see in future patients. While if there was high heterogeneity, the study could shift focus and give greater emphasis to exploring potential effect modification. A natural way to measure this heterogeneity might be with something like a 95% prediction interval, as in meta-analysis.

But i’m not sure if this makes sense? As I do not see it used routinely. If anyone is able to provide guidance on this issue and any potentially helpful literature I would be very grateful. If it is sensible, it’s something I would like to adopt in my own work. Thanks!

1 Like

It might help to clarify what type of meta-analyses (MA) you are referring to. Retrospective MA do not have enough control over the data generation process to put much credibility in low heterogeneity statistics. Low Q or I statistics could always be a symptom of publication bias.

More persuasive would be to control for sources of heterogeneity at the design phase – ie. synthesize future experiments to be performed by the same team or agent, with specification of what most likely sources of heterogeneity need to be accounted for.

I’ve collected a number of journal articles in this thread. Start with Senn’s articles, and then look up what seems relevant to your problem.

Also study a few posts by @f2harrell on the correct analyses of this type of data. I’ve reluctantly come to the sad conclusion that most of the published work using this type of data is not reliable, given how it is analyzed.

The 3 comments are also related to the issues surrounding analysis of PRO data:

1 Like

The meta-analyses I’ve seen of continuous outcomes (and I’m not sure yours is really continuous) have demonstrated anti-heterogeneity of treatment effect. That is they have found lower patient-to-patient response variability in treated groups than in placebo groups. So be careful. A couple of other points:

  • the typical clinical trial has an insufficient sample size for precisely estimating the average treatment effect, so is hopeless for estimating differential treatment effects
  • a good statistical analysis plan for interaction modeling is a must
  • Bayesian modeling may be necessary because without somewhat skeptical prior distributions for interaction effects (i.e., some borrowing of information) the sample size will be inadequate

Thanks @R_cubed for the links. The Senn articles seem relevant and a great place to start. I will read through.

Oh yes, poor wording on my part, clearly discrete outcome in the example used. I suppose this will vary by discipline, but I have some reason to expect meaningful heterogeneity. Clinicians will often trial 3 or 4 medications before they find something they think ‘works’ for the patient with an affective disorder. Most don’t respond to their first medication. The troubling fact of these RCTs is that 10% of treated patients might have full remission while 80% don’t respond at all, and that might amount to a small average treatment effect. Conversely, it’s possible that 80% might have a small but consistent reduction in their symptoms which corresponds to the very same average treatment effect. If I read a published trial I usually can’t discern which is more likely - and a numerical index of treatment effect heterogeneity I imagine could be useful for this. Something like a 95% prediction interval?

Very true. I suppose I’m asking though - in these contexts of highly variable response would a very under-powered analysis of prospectively defined (psychological/genetic/social) subgroups be more valuable than a slightly under-powered average treatment effect - given the latter has little clear value in this context? I don’t ask rhetorically - I think you’re probably right that most studies can’t even begin to answer interaction questions.

But should heterogeneity still be assessed in some way routinely? If not, do we risk signaling too much confidence in an ‘average treatment effect’ we might not believe a clinician will ever see in a given patient?

This isn’t very clear from most sources, but the consensus in this forum seems to be that population “average” treatment effect estimates aren’t exceptionally useful (especially the “standardized” mean difference), and odds ratios are a good default, as they have an interpretation for individuals.

Senn’s article U is for Unease (in the links above) goes into the issues about effect size measures and how they can increase or decrease heterogeneity.

If the variable of interest has natural units, an average treatment effect could make sense. For PRO, they don’t.

This thread also goes into the challenges of aggregating estimates from studies.

As @Stephen has pointed out this is completely compatible with every patient having the same probability of treatment response.

I don’t think so. I am not swayed by any subgroup analysis unless you provide evidence that the subgroup is “special”, i.e., that the treatment effectiveness is different in that subgroup than in the remaining patients (i.e., interaction).

1 Like

Questions around patient heterogeneity seem to arise predictably in fields with frustrating clinical trial track records (e.g., research on major depression, chronic pain, sepsis, cancer). The recurring hypothesis seems to be that there are probably many different “phenotypes” for these conditions that have not yet been disentangled/elucidated and it’s folly to think that a given therapy should work for all of them. And if this is the case, then an inconsistent phenotype mix between trials might underlie the variable trial results we see.

I’m not qualified to discuss heterogeneity in the statistical sense, but thought this paper was interesting. It summarizes problems that can affect research into “complex syndromal illnesses” like depression.


I don’t know if that paper covers this, but depression trials are often analyzed in a way that makes the result somewhat irrelevant and possibly misses important treatment effects. The tendency is to analyze change from baseline, which assumes that baseline is linearly related to the, say, 12 week depression score, and that the slope is 1.0. The problem is that neither of these assumptions holds and in fact the baseline can be quite nonlinear in predicting the 12w score. This is due to patients with severe initial depression getting more benefit than those with milder baseline depression. Here’s a case where an apparent heterogeneity of treatment effect is induced by a bad analysis and is sitting right under your nose. Sponsors are looking for genes that explain results and don’t even look at the simplest data. A proper analysis would be to use ordinal regression with Y=12 score and using a cubic spline function of the baseline score as covariate. You may find from this proper analysis (I boldly say “proper” because it actually fits the data) that the treatment effect is constant, but the treatment effect is the shift in 12w score for a patient with a given baseline.


Whenever I read about common statistical errors like “change from baseline” analyses or the pitfalls of clinical scoring systems, I feel despair. Entire fields of medical research have been using these approaches for decades, seemingly unquestioningly. This fact suggests that very few researchers are aware that they are problematic.

When trial results are consistently disappointing, the impulse of researchers seems to be to search for hidden complexity in disease phenotypes as an explanation, rather than considering the possible contribution of recurring bad/misleading statistical techniques. Many diseases likely will turn out to be much more complex than we realize. But forging ahead with ever-more complex/granular research into disease phenotypes, before attempting to gauge the impact of poor analytic/measurement decisions, seems unwise.


Thanks for drawing my attention to Senn’s U is for Unease article - very clarifying on several points!

This is an excellent paper, thanks for bringing it to my attention. I hadn’t really given so much consideration to the noise inherent in sum scales that combine so many distinct symptom domains (which are themselves complex and could be even further subdivided!).

This is an excellent point. In pre-post design’s this kind of ANCOVA is still only sometimes used, and still then I’ve never seen a study that has allowed non-linearity for the baseline score.

In recent years, a somewhat more common approach than the pre-post design is the repeated measures (BL, week 4, week 8, week 12) design. Usually, this is modeled with linear mixed effects models with a random intercept and slope. Time is usually treated as continuous or logarithmic (to account for the fish-hook response pattern common in depression studies).
A big controversy is whether baseline Y should be included in the outcome or as a covariate. I’ve seen arguments for each. For including BL Y in the outcome, you capture the differential slope between groups between BL and week 4. For including BL Y as covariate, you reduce mean squared error (improving efficiency) and can allow for treatment response to vary depending on baseline severity (BL Y * Group * Time interaction). I would be interested in hearing anyone’s opinion?

I’ve delved into this question in detail in Chapter 7 of RMS. I am at a loss of why so many statisticians believe that a pre-randomization measurement should be considered as an outcome. The two simplest reasons for this being incorrect are:

  • Often the baseline value is an inclusion criterion, e.g. depression score > x. That creates a truncated distribution that cannot be modeled the same way as the untruncated post-randomization measurements.
  • As you’ll see @Stephen quoted in the notes, baseline is just not a response to treatment.

By modeling baseline as a covariate using a flexible nonlinear function you can get the “right” answer. Then there is the problem of correctly modeling correlation. Random effects are routinely used without adding additional correlation structure. This induces an unlikely-to-fit compound symmetric correlation struction. Markov longitudinal models allow for much more generality.


marketing/medical writers know the argument is coming though, they accept it and swat it away. That’s why senn’s unequivocal statements such as: “barely a statistician of repute can be found to defend the practice …” can be kept in the back pocket, definitive statements from authorities like this, and guidelines, are so valuable for the practicing statistician, in certain settings i should add

I read RMS perhaps around two years ago and I have been a proponent of this strategy (baseline on RHS), but part of a minority I believe. I must have been reading an old edition because this section has been expanded a lot since. A great bit of text to reference when this argument inevitably comes up again! Thanks a lot.

1 Like

Hello all, to resurrect an older thread, I was hoping I may ask an additional question inspired by the very good Arif et al. paper cited by @ESMD.

To quote from the paper:

“All of these effects, going on just below the surface, are glossed over in a composite scale score. When a patient receives a HAM-D total score of 25 at the start of trial and later drops to 15, we have absolutely no idea what happened and in what ways that patient improved. With all of the underlying elements pulling on one another and moving in opposite ways, the overall readout from the total score is generally that little movement has occurred at all.”

I’m wondering whether there’s a good hierarchical solution to this problem? For instance, the HAM-D has 17 scored items. Could a distinct treatment effect be modeled for each item with some ‘partial pooling’ towards the overall mean treatment effect? As we would imagine that the treatment effects would be quite correlated across items.

The advantage of a hierarchical model like this might be that you could identify which scale items show the most/least improvement (or worsening), providing much more information about what the treatment is actually doing at a granular level (compared to a summary scale of dubious validity). I would love to hear your thoughts on this as it is a bit beyond my depth.

EDIT2: Tied myself in a not trying to figure out how the model might work. Deleted that for your sanity.

1 Like

i like the multivariate approach and a weighted average could be obtained from this if desired. GEE modelling was suggested for the case of dochotomous outcomes and contrasted with the composite approach by mascha for example, just as you describe
edit: mascha

1 Like

I would need to see some evidence that the HAM-D was improperly constructed, i.e., that it’s summary score leaves out important information. The only way I can think of demonstrating that is to have a large database with excellent follow-up where an ultimate outcome was assessed. Then one could have a contrast between prediction of that ultimate outcome using the earlier HAM-D or using all of its constituent items.

ive had analogous situations where the clinician is saying: some components are more sensitive to intervention, they are worried other components will drown these out, and using this reasoning they select a subset of the scales to form a primary outcome. One problem i have with composites is that ultimately, even the people who insisted on the conglomeration of measures want to take it apart to understand where the effect lies, and then there’s no power on the individual components and we drown in speculation

edit: i was alluding to the quoted text above: “underlying elements pulling on one another and moving in opposite ways”


Excellent point, because Ham-D was actually constructed to optimize sensitivity to pharmacologic effects, i.e., without putting patients first.

1 Like

Thanks for the reference. Although not the exact same problem - this translated quite well and was very helpful.

There is one aspect of there approach that doesn’t feel totally satisfying. While this seems to make a lot of sense for estimating a global effect, they still relied on a pretty severe p value correction to estimate item specific effects. It’s seems like a random effects approach could work here? As in simultaneously estimating a global treatment effect and item specific deviations from that global treatment effect (with the latter shrunk towards the global treatment effect). Potentially efficient and with no harsh multiplicity correction?

i used random effects here: https://www.ahajournals.org/doi/10.1161/CIRCOUTCOMES.116.003382 i don’t produce an estimate across outcomes, i’m not inclined in those circumstances. But that was time-to-event outcomes and a more relevant example for you might be the blast-ahf study: Biased ligand of the angiotensin II type 1 receptor in patients with acute heart failure: a randomized, double-blind, placebo-controlled, phase IIB, dose ranging trial (BLAST-AHF) If you look at firgure 2. They translate outcomes to z-scores, and then provide a composite overall z-score. Maybe they do this simply to obviate the multiplicity issue that you point to, without limiting to a single outcome. I like the multivariate approach that you’re suggesting and prefer not to blend outcomes, and then multiplicity remains an open issue. But maybe someday we will start to think that a single primary outcome is unnecessary, eg the Bayesians could persaude us, or the clinicians could… I’m currently re-reading Stiglers small book 7 pillars, and how foolish people were to think averaging across disparate items was senseless, maybe my instinct is crude in the same way