Changing the "Status Quo" analyses for Regulatory Submission - Change from Baseline

daszlosek · June 2, 2022, 2:14pm

I have been having been consulting with a company that plans to use change from baseline for the primary endpoint.

I have shown them multitudes of studies (many from: Common Statistical Myth - Change) that show change from baseline requires a lot of assumptions and have even shown some of my own simulations from their data on the value of using raw scores over change from baseline.

I have gotten some pushback by their lead statistician, who was also a former FDA statistician, and their thoughts are as follows:

Using change from baseline has been the industry standard for 25+ years and any deviation will be a red flag to the FDA.
Not using change from baseline would mean this study is not “as directly comparable” to all the previous studies that have used change from baseline as the primary endpoint.

Now, most of my work is on the non-regulatory front but both of these arguments seem weak to me. At the same time, I don’t want to make a mountain of additional work during the submission process and I understand the thought process that the lead statistician is coming from.

Has anyone gone through the process of changing the “status quo” of the main analysis used during submission?

What pushback did/have you received from regulatory bodies?

ADAlthousePhD · June 2, 2022, 3:58pm

Sorry to hear this, @daszlosek. I have precious little experience working through the FDA regulatory process, so I’m afraid that I can offer little constructive commentary beyond things you already know, and of course I’m always open to having my mind changed if something I say below is wrong. But…

First, try Frank’s intuitive explanation about the quantity of interest being the difference between the groups after treatment, using adjustment for baseline score, so the statistical model essentially asks: “for 2 patients with the same baseline score, one receiving treatment and the other placebo, what is the expected difference in their follow-up scores?”

Second, send the BMJ Stats Notes paper explaining that raw-score-adjusted-for-baseline has increased power versus using change scores. (EDIT: and this paper National Institutes of Health Stroke Scale as an Outcome in Stroke Research: Value of ANCOVA Over Analyzing Change From Baseline - PubMed (nih.gov)

Third, use simulations to show performance advantages of this approach over using change scores (this ought to be more convincing than it actually seems to be in practice; unfortunately it seems an appeal to authority, like #2, is often more effective than actually illustrating something yourself).

You have probably done all that, unfortunately, and still meet resistance. My sympathies. I do wonder if other statisticians or consultants with regulatory experience will be able to chime in, though perhaps they cannot always reveal details depending on the nature of their work and agreements with clients.

Both of the arguments you listed are indeed weak - neither is a scientific or statistical justification; both are simply arguing to do it because “this is the way we’ve always done it” (and while sometimes there may be a good scientific or statistical reason for “this is the way we’ve always done it” - then that’s the argument we should hear, not simply “this is the way we have always done it” by itself).

The second point is particularly frustrating to me, perhaps more than the first. Why is it a requirement for a new study to be “comparable” to a previous study? The goal ought to be designing a new trial to achieve its objective(s) as efficiently as possible. Saying that any new study must use a particular statistical approach just to be “comparable” to previous studies is ridiculous (by that standard every new trial must be designed the same as any previous trial, we may as well cease research into any new statistical methods for clinical trials - every new trial simply must be done with the same design as predecessor trials).

I, too, am sympathetic (if frustrated) that the attitude from a company/company’s statistician would be the conservative “We want our study design to be accepted by FDA and any other reviewers without too many arguments, so we’d rather be less efficient but guarantee that our drug will be approved if it meets the primary endpoint” rather than proposing something even slightly “different” (how amusing that even a fairly simplistic analysis is perceived as too risky). I have not done much consulting for industry sponsors, but in the little experience I have, those vibes are very strong - would rather use a statistical design that they already know FDA is accepting of. The endgame is to guarantee approval. Potentially needing fewer patients (and therefore saving a few dollars on the trial) seems to be viewed as a relative pittance if it puts the approval at risk for any reason.

It’s not a regulatory setting, but in one trial that I am currently working on, our first meeting with the DSMB had several similar comments from the DSMB statistician. We will have a baseline measure and then a 6-month assessment and our statistical analysis plan is, effectively, this:

Follow-Up Outcome = b0 + b1*(Treatment) + b2*(Baseline Measure)

b1 will tell us the treatment effect on follow-up outcome, adjusted for baseline. Simple, right?

The DSMB statistician asked why we were not doing a t-test comparing the change scores between groups, and despite my best efforts to explain the above, seemed unconvinced. This person did not really have authority to mandate changes to our statistical plan, but it was frustrating to encounter a dogmatic, outdated view. Frank has previously noted that perhaps some statisticians are resistant to change because it feels like critiquing their past selves - for which I have some mild sympathy, but it does not seem a good reason to demand others follow your preferred approach if the alternative has a strong statistical justification, as would be the case here.

Anyways, happy to hear more educated opinions from others, and apologies that I could offer mostly just “commiseration” rather than constructive suggestions.

daszlosek · June 2, 2022, 8:34pm

I thank you for the additional reference! I have given some literature (including Vicker’s excellent 2001 article and Altman’s 2015 article on the topic) and done simulations to show these issues more explicitly with no luck on persuading the team.

It is a little comforting to know there are others who have faced similar difficulties in the regulatory space as I am a bit crestfallen about the situation.

R_cubed · June 3, 2022, 1:33pm

I would have thought the Bland and Altman citation, using elementary algebra to prove that \alpha is inflated, would terminate the dispute in favor of the correct method.

But from a business POV, if a regulator is going to let you get away with a procedure that inflates \alpha, why would you use anything more rigorous?

davidcnorrismd · June 3, 2022, 2:20pm

FDA has produced guidance on interacting with them around “complex and innovative designs.” Even if you might be embarrassed to describe eschewing change-from-baseline as “complex & innovative,” still you might find hints here at how to depart from 25+ years of standard practice.
https://www.fda.gov/regulatory-information/search-fda-guidance-documents/interacting-fda-complex-innovative-trial-designs-drugs-and-biological-products

timdisher · June 3, 2022, 6:20pm

Am I misremembering that the sneaky trick here is to use change from baseline adjusted for baseline? I can’t remember where I read this so just did quick sim in R and I get identical standard errors while still keeping the FDA happy maybe?

library(dplyr)
set.seed(42)
sim <- purrr::map_df(1:10000, ~ {
  
  n <- 100
  trt <- rbinom(n, 1, 0.5)
  te <- 1
  b.base <- 0.5
  y1 <- rnorm(n, 5, 4)
  y2 <- rnorm(n, y1*b.base + trt*te, 4)
  
  chg <- lm(I(y2-y1) ~ trt)
  adj <- lm(y2 ~ trt + y1)
  chg.adj <- lm(I(y2-y1) ~ trt + y1)
  
  
  purrr::map(dplyr::lst(chg, adj, chg.adj), ~ {
    coef <- coef(.)[["trt"]]
    se <- sqrt(vcov(.)["trt","trt"])
    data.frame(
      est = coef,
      se = se
    )
  }) %>% dplyr::bind_rows(.id = "model") %>%
    dplyr::mutate(bias = est - te) %>%
    tidyr::pivot_wider(names_from = model, values_from = c(est:bias))
  
})

colMeans(sim)
sim %>% dplyr::summarise_all(sd)

Output:

    est_chg      est_adj  est_chg.adj       se_chg       se_adj   se_chg.adj     bias_chg     bias_adj bias_chg.adj 
 1.001520237  1.001227702  1.001227702  0.897612443  0.806282133  0.806282133  0.001520237  0.001227702  0.001227702

f2harrell · June 3, 2022, 8:40pm

That sets a bad precedence IMHO because it’s only for a very special cases of the linear model with perfectly transformed Y and a linear relationship between X and Y.

ChristopherTong · June 4, 2022, 5:02am

Two of FDA’s own statisticians studied this issue at length and presented their findings at JSM 2009 and in this 2011 paper:

Entertaining footnote: after the authors’ JSM 2009 presentation, two people from the audience offered comments. Frank @f2harrell pointed them to Stephen Senn’s papers on this topic, and I pointed them to my then-colleagues’ work at Merck (Liu et al, 2009) which was either in press or freshly published at that point. In the authors’ 2011 paper’s acknowledgments they thank “two anonymous audiences at the JSM” for making them aware of the Senn and Merck papers.

pmbrown · June 8, 2022, 10:17am

“former FDA” doesnt mean anything, he/she now represents the company. You just need good people at the FDA, that’s where push back will come from (I believe Frank Harrell is now advising there ). Unfortunately as a consultant you have to heed the advice of your client. It’s a tough position to be in, you might notice when you raise an issue the other consultants in the room keep quiet, that’s what i noticed when consulting, they need the work, theyre self-employed. The problem you have is that the very best argument to persuade the client is to say “it’s what the regulator want/expect” (it’s not using technical statistical arguments), but the statistician at the company has diffused that argument.

f2harrell · June 9, 2022, 12:10pm

It would help to figure out what influences people the most, e.g. to show that change from baseline

loses power
is more difficult to interpret
does not handle ordinal Y
requires a second baseline if the measurement is used for study qualification

A consultant should at least go down kicking and screaming.

pmbrown · June 11, 2022, 7:33pm

in industry (maybe not elsewhere) my experience suggests that all of these will fall flat; it will be implied that the statistician is missing some bigger picture, they’re being pedantic or theyre a purest, that there is a higher level understanding that acknowledges ‘the real world’ (which is never made too explicit). Eg, 20 years ago people would almost virtue signal when explaing why they offer locf in the sap, ie they have a more enlightened, encompassing view that acknowleges the doctor’s inability to comprehend complex stats; statisticians, when presenting to other company statisticians, would begin with an apology: “i know you won’t like the stats but … it was important for the clinical people”; statisticians are just as venal as the rest.

But people turn on a dime when i’ve said “the fda won’t like it”, and talk from experience, ive seen a whole room change direction with that remark, and in the more serious cases (when someone has conjured some idea for analysis that necessitates the disappearance of some sae’s) we must go for the most efficacious argument, whatever works. i worry that if a statistician begins with technical arguments the others in the room won’t just stop listening to the arguments, they’ll stop listening to the person. On the other hand, sometimes a statistician will produce a neat turn of phrase, something simple and concise but devastating, when i hear such a phrase i write it down, eg i cant give the context but i recently heard a statistician declare in conclusion: “the p-value has lost its interpretation”, it was powerful for a statistician to say something so simple, leaving no room for misunderstanding or disagreement

f2harrell · June 11, 2022, 11:35pm

Paul I think those are very cogent observations, and the LOCF (last observation carried forward) reference is a great example. It took a long time to almost get rid of LOCF but we’ve almost done it. But I think it’s also an example of a missed opportunity of using a purely technical angle that in my humble opinion would have convinced everyone years ago: The type I assertion probability \alpha is not even close to being preserved when you use LOCF. It’s funny how many things we simulate and learn don’t work but here’s a case where no one seemed to have done a simple simulation a long time ago to study \alpha. they would have found that the standard error of the treatment effect was way underestimated when using LOCF.

So going back to the 4 angles I listed above aren’t you discounting technical arguments just a bit too much?

ChristopherTong · June 12, 2022, 10:53pm

Two points:

First, to anyone who tries the “FDA won’t like it argument”, show them the paper I posted above. In the paper, FDA statisticians considered 5 ways to handle baseline data - they recommeneded 2 of them, neither of which corresponds to the naive “change from baseline” analysis. While this does not decisively refute the “FDA won’t like it” argument, it shows that they may be more open minded and thoughtful than we presume.

Second, I do not think technical statistical arguments should always rule the day. The clinical, regulatory, and business context do matter and must be considered. In some cases technical arguments should prevail and should be pushed vey hard; however in other cases, doing so is simply reckless. Statisticians should seek to avoid one-size-fits-all, context-free thinking. We should certainly be familiar with the technical drawbacks of “change from baseline”, but also understand the context of a specific study and exercise judgment on a case by case basis.

pmbrown · June 13, 2022, 8:40pm

yes, i quite like simulations for persuasion, eg Senn’s illustration of responder analysis, the plot is compelling. I guess it seemed too much trouble to write the code back then, but things have become easier

i was happy consulting in academia because the anticipation of peer-review was enough to make them heed the advice of a stats expert, they mostly just accepted my recommendation. But in industry the veracity of some analysis may be considered beside the point, the client says: we want it for our marketing materials, and your boss wants to appease the client. I’ve refused to produce analyses (in academia and in industry) and refused to sign off on others. But im not sure this is discussed much among statisticians, sometimes i have been a shoulder to cry on for some statistician who has been asked to run a rigged analysis, they whine over lunch and then trudge back to their desk to start on the code, theyre probably told it’s just a “sensitivity analysis”, they “just want to see what it looks like”

edit: i think it’s important to notice the manoeuvre made by the company statistician when he/she says it’s “the industry standard for 25+ years” (from the OP above), ie they’re not implying that it is likely a correct analysis due to its vintage, they are saying it is a “correct” analysis, it’s a subtle manoeuvre, a “correct” analysis can then mean whatever is expedient, not what is literally correct

f2harrell · June 15, 2022, 6:57pm

Context-free thinking should certainly be avoided but I have yet to see an example other than a single arm pre-post design in which change from baseline was the right thing to do.

ChristopherTong · June 17, 2022, 1:52am

For many years I would have said the same, and certainly most of the time it remains true. However I’ve learned that this is a topic worthy of rethinking afresh each time it’s encountered. Sorry that I can’t share more (proprietary).

f2harrell · June 18, 2022, 10:51am

If you come across a compelling example arguing for change from baseline in a parallel group study I’d be shocked.

davidcnorrismd · September 1, 2022, 8:34am

Been thinking about this remark lately… What about transforming Y before calculating the change? Has anyone examined searching for an optimal transformation that renders “change from baseline” as harmless as possible? If most of the damage can usually be undone by transforming Y to a scale where ‘change’ is valid, then perhaps peoples’ understandable desire to talk about “change” (and to use subtraction to calculate this) can be accommodated without major harm?

Looked at another way, maybe this recalcitrant insistence on change-from-baseline amounts to an implicit request to the statisticians to do precisely this — search for a scale for Y that makes ‘change’ a valid concept.

f2harrell · September 1, 2022, 12:06pm

Yes that is covered in the BBR chapter on change. It’s not always successful and no one has derived the variance penalties needed when you do not pre-specify the transformation. Transformation uncertainty certainly will affect inference.

davidcnorrismd · September 1, 2022, 1:30pm

I enjoyed re-reading the chapter! I suppose I’m addressing point #4 listed in §14.4.1, although not (as there) specifically scoped to randomized studies.

Doesn’t transformation uncertainty apply no less to the ‘default’ (identity) transformation, and indeed to all other uncertainties about model specification? Why shouldn’t a search for a nonparametric monotone transformation of Y be a regular part of data-driven modeling? If we confine ourselves to a handful of off-the-shelf model forms, then learning what transformed \mathcal Y works best with these might be a legitimate part of communicating with the end-users of statistical analyses.