Recently, I had a discussion with a colleague regarding the use of percentage change as a primary outcome measure in pre-post RCTs. In my experience, it is generally considered a bad idea to use percentage change from baseline as the primary outcome to be analyzed. However, my colleague pushed back with two points:
It is commonly used, especially in trials of weight loss. I was surprised to hear this, but upon looking up the GLP-1 agonist trials, that claim appears to be correct.
It may be more accurate for interpretation purposes. For example, 5 kg of weight loss could mean a lot more to 100 kg person than a 150 kg person.
I pushed back by pointing out the loss of statistical power (Vickers), possible interpretation issues (Curran-Everett), and problems related to symmetry (Berry and Ayers, and Cole). I also pointed out that I thought that most of the “advantages” of percentage change could be accomplished with a natural log transformation of either the post score (outcome), the baseline score (covariate), or both (I believe this is mentioned in both Senn’s “statistical issues in drug development” book [section 7.2.6] and Harrell’s BBR section 14.5). In my view, you could interpret the results as “sympercents” or ratios without the symmetry or power issues.
My colleague kindly pointed me to Kaiser (1989) and stated that percentage change isn’t that concerning so long as assumptions are checked, and my colleague mentioned that log-transformed results are more confusing to communicate to clinicians. Having read that paper, I’m not sure I come to the same conclusion. My interpretation of Kaiser is more that percentage changes is more often misleading/inefficient, but in certain scenarios it may not be quite as misleading/inefficient. I’m also not sure I agree that log transformed variables are more difficult to interpret.
My questions to the forum:
I am left wondering if I am missing something here.
Is there some reason that I am missing that makes percentage change more useful or efficient in pre-post RCT designs? Are there any possible advantages, beyond ease of interpretation, that make percentage change from baseline desirable? Or are there certain edge cases where percentage change is superior to other approaches?
Also, I wonder why large, expensive trials such as those on GLP-1 agonists seem to almost universally report percent change as the primary outcome of interest? Is this out of convenience or do they know something I don’t?
I think you’re spot-on for the issues of percentage change (PCHG) and I haven’t found a valid reason for using it as a primary endpoint.
It might be helpful by demonstrating with your trial data or other publicly available data to show why PCHG won’t work as mentioned in below post:
ie, check model performance (adjusted R-squared values) for an ANCOVA model using the following endpoints 1) PCHG, 2) a natural log transformed endpoint, 3) raw data. You might find that approaches 2 and 3 would give reasonable R-squared values and approach 1 would be close to 0.
Great question! As a statistician that will likely have to design a weight loss trial, I have been wondering about the exact same issues and I completely agree with you.
The only answer I could come up with was that sponsors were following the FDA “Guidance for Industry Developing Products for Weight Management” in which percent change from baseline is specifically mentioned as the preferred primary endpoint (see Section IV.B.3.). Would be great with some input from a statistician from Novo Nordisk or Eli Lilly though.
As discussed here, the percent change from baseline used to define “response” in solid tumor oncology trials is clinically inappropriate and unintuitive to many of us physicians. A tumor that grows from 1 cm to 2 cm in max diameter is very different than one that grows from 5 cm to 10 cm. Our clinical response to the latter will often be very different than the former even though both correspond to “100% increase from baseline”.
Why is this even an issue? You can do the analysis is whatever way you want (and I hope you’d choose the most statistically powerful method, like ANCOVA) and then convert the results to a percentage change for additional clinical illustration. For instance, if the mean baseline symptom score is ten, the mean posttreatment score in the control group is 8 and the difference between groups (from ANCOVA) is 2, then you could say “These results are equivalent to a 20% reduction of symptoms in the control group compared to a 40% reduction in the treatment arm”.
Very well put. This relates to one of the largest quantitative thinking errors I see among clinical researchers. Too many of them think that the way you want to ultimately summarize the results needs to also be the way that individual patient responses are computed.
A corollary issue is that % change should be computed on summary measures, not raw data.
Interestingly, the way results are often summarized in manuscripts is also not the way we typically process information in the clinic. We would revolt if health records changed from showing patient age as “less or greater than 65”. And I simply cannot remember the last time I assessed (outside of following clinical trial stipulations) tumor response in a patient based on percent change from baseline as opposed to considering the actual tumor measures.
Note however that using change of tumor burden from baseline is something that we do when caring for a patient and is compatible with how observational studies (but not RCTs) would be analyzed.
As a patient advocate and life-long student of clinical research in cancer trials I’ve come to appreciate that improved survival and/or quality of life are the endpoints that matter. How might this apply to weight loss trials? Perhaps the study would compare similarly age groups with similar bmi for living longer and better (the latter determined by compared ePROs).
Weight loss change from baseline with QoL comparisons seems reasonable as a surrogate providing Qol outcomes are equivalent or improved for the study drug. FDA licensing could be conditional based on the surrogate requiring some predetermined difference, or no difference, in OS and QoL. Again, for emphasis, a lay-person’s opinion and I’m not wedded to it.
I agree, Karl! It would be interesting to know what the patient-centered outcomes are, across a panel of participants in a weight-loss trial. I wonder if these might even lend themselves to being treated with the kind of ordinal outcomes that @f2harrell advocates. Perhaps there is some natural sequence of beneficial health outcomes that trial participants might ‘acquire’ as their weight loss proceeds successfully?
Change from baseline is a particular bad metric for weight loss, since people lose weight in a way that is halfway between relative and absolute. Also, bad clinical outcomes are predicted by current weight, not recent weight change. So weight should be the outcome variable, adjusted for height and initial weight through covariates.
Yeah, my starting point is always to use ANCOVA. I’ve also argued the point about computing percentage change using the summary statistics, but I still have push back on that front too.
On the point of “why is this even an issue”, I’d guess you’d have to ask my colleague or even the majority of research teams in this area since it is so common.
I’m asking this question on this forum because I assumed there might be some information I’m missing. Based on the responses here I’d now assume there isn’t really a good statistical argument for using percentage change but it is probably reported the primary endpoint because of the FDA guidelines that were shared here (thank you @jesoje !)
I’ve been thinking about this exact issue for some time now as my background is initially as a clinician before gaining experience in applied statistics, and I’ve some to the same conclusion. This is simply a matter of precedent (by both the FDA and clinical practice) and there isn’t a good statistical reason for it as far as I can tell.
Interesting to consider how the Gell-Man Amnesia Effect might apply. We’re all aware how badly FDA can bollux up stuff (several of my own favorite examples here and here) within our field of expertise. But what if everything the gubmint does outside our fields of expertise is equally bonkers?
Are there published references on this approach? Asking because I’ve had colleagues question its validity even though I have demonstrated that one can easily convert the estimated treatment effect from an ANCOVA model into a percentage interpretation, and that this approach seems reliable when tested with various means and SDs via simulation. Whereas using a PCHG endpoint repeatedly showed poor performance and incorrect results. I feel that the approach using ANCOVA to arrive at a percentage interpretation is just simple math but surprisingly this may not seem the case, even to some statistician colleagues, unless I could demonstrate the approach is well documented in a textbook or publication. Any advice would be appreciated. Thank you.