Exemplary QOL analyses that avoid change-from-baseline blunders?

The BBR Course Notes §§14.4–5 thoroughly impeach change-from-baseline analysis. I had previously avoided the practice simply on regression-toward-the-mean grounds and because @f2harrell said so. But I find myself really astounded to (re?)discover how comprehensive and multifaceted a case can be made against it!

That said, I think sometimes the best way to discourage bad practices is to provide positive examples of the good practice. This may be especially true when we are reaching out to non-statisticians who may have real, pressing interests in promoting sound statistical analyses. (FYI, this question is prompted by this petition from lymphoma patient advocate Karl Schwartz.)

Can anyone point to exemplary analyses of QOL, especially in the clinical trials literature, that avoid change-from-baseline pitfalls?

10 Dec 2019 Addendum

In case it helps focus this question a bit better, I’d like to articulate formally the aims of Karl Schwartz’s petition linked above, as I’ve come to understand them after some further discussion with him. As I now understand, the problem originates with the use of surrogate endpoints such as progression-free survival (PFS) in regulatory decision-making. The reliability of such surrogates as predictors of meaningful clinical benefit is debated. From a biostatistical perspective, a natural remedy for such a ‘noisy’ univariate outcome would be to augment it to obtain a multivariate outcome that incorporates other information as well. The petition asks specifically that QoL information derived from patient-reported outcomes (PROs) be incorporated together with PFS in regulatory decisions. Interestingly, this placing of PFS and QoL on a similar footing (i.e., as components of a multivariate outcome) further underscores the correctness of comparing QoL endpoints between arms, as opposed to QoL ‘changes’ from baseline in individuals.


The ISCHEMIA study QOL analyses led by John Spertus are role models for how to do this, avoiding change from baseline and using Bayesian models to get more interpretable and accurate results. We’ll have to wait for a paper or preprint to see analysis details.

I have a perfect example but of how a proper analysis yields a much different (and more correct result) but it uses proprietary data so I can’t show it. But here’s a description. The Hamilton-D depression scale is often used in antidepressive drug development. Change from baseline to 8w is a very common outcome measure even though (1) this disrespects the parallel-group design and (2) difference in Hamilton-D were never shown to be valid patient outcome measures that can be interpreted independent of baseline.

I re-analyzed a pharmaceutical industry trial using the best available statistical approach: a distribution-free semiparametric proportional odds ordinal logistic model on the raw 8w Ham-D value, adjusted flexibly for the baseline Ham-D by incorporation of a restricted cubic spline function with 5 default knots. Here was the result and ramifications:

  • Unlike the linearity required for valid use of a change score, the relationship between baseline Ham-D and 8w Ham-D was highly nonlinear. The shape was similar to a logarithmic function, i.e., very high baseline Ham-D can be “knocked down” and become a moderate or low 8w Ham-D. There was a flattening of the curve starting at Ham-D=22.
  • This function shows that an excellent therapeutic effect may be obtained in severely depressed patients.
  • It also shows that an average change score, which assumes not only linearity but a slope of 1.0, is highly misleading. The excellent potential for high baseline Ham-D to get lower will be averaged in with the less amount of change from lower baseline Ham-D patients. The resulting average change from baseline underestimates treatment effect in severely depressed patients and overestimates. The average change from baseline may not apply to any actual patient.
  • The proper ANCOVA respects the goal of the RCT: if patient i and patient j both start with a Ham-D of x but are randomized to different treatments, what 8w Ham-D are they likely to experience?
  • ANCOVA using ordinal regression completely handles floor and ceiling effects in patient outcome scales.

This criticism in particular should prove immediately accessible and intuitively appealing to the patient-advocate perspective that motivated my question. I can’t think of any more devastating challenge to any biostatistical procedure, in fact, than that it bears no relation to the concerns, outcome, etc., of any individual patient. I conjecture that one could make this the axis on which a comprehensive critique of change-from-baseline turns.

1 Like

This can be said for any endpoint (PFS, OS, response) right?
To clarify my advocacy:

  • PRO instruments standardized to compare fatigue, physical pain,… overall QoL across studies
    (so the subjective experiences is given directly by patient (apart from stressful consults)
    not misinterpreted, or under-reported by study doctors)

    • in future: included in all studies with ePRO
      to improve safety (timely reporting of changes that require supportive or ER care)
      a study to be referenced later showed improved OS associated with this practice.
  • As an required consideration in studies using surrogates (PFS) as primary endpoint.
    Especially relevant to RCT tests of continuous use and maintenance protocols
    (Is potential gain in PFS (often modest) worth the additional toxicity experienced on a daily basis?)

  • Not typically to be used in isolation as the basis for approval.

  • Cutoffs for clinically significant differences in PRO scores.
    … to inform regulatory and clinical decision-making.

  • Reporting of PROs should standardized - done in a way that fosters patient understanding as a decision aid. Example: Do I want to pay for an expensive treatment (in some cases bankrupting my family) to possibly gain x months, when the treatment is unlikely to improve (or impairs) my QoL on a daily basis?

1 Like

Based on my understanding, all of those things apply, and none of them are satisfactorily addressed by computing change from baseline. Gain in Y needs to come from comparing Y in treatment A patients to Y in treatment B patients, adjusted for baseline to gain power/precision.


@davidcnorrismd, I felt that the ARAMIS trial did a great job analyzing QOL outcomes.

1 Like

They did this:

The secondary end points were overall survival, time to pain progression (defined as either an increase of ≥2 points from baseline in the score assessed with the BPI-SF questionnaire or initiation of opioid treatment for cancer pain, whichever occurred first)

This is very problematic, although challenging to deal with optimally when the score is combined with other endpoints.

For quality-of-life variables, an analysis of covariance model was used to compare the time-adjusted area under the curve (AUC) between groups, with covariates for baseline scores and randomization stratification factors. The least-squares mean and 95% confidence interval was estimated for each group and for the difference between the groups.

The first part of this is good, if change from baseline was not computed. But to assume Gaussian residuals in a regular model for this kind of response variable is tenuous. The variable calls for an ordinal regression.