FDA "Bad Habits"

EpiMD5 · February 21, 2023, 9:28pm

A February 21, 2023 Wall Street Journal editorial titled “The FDA Returns to Its Bad Habits” discusses a randomized trial of a medication to treat Friedreich’s Ataxia sponsored by Reata Pharmaceuticals Inc. In the WSJ article, the effect of the drug in 103 patients in the trial was described as “a statistically significant improvement” in patients treated with the new drug compared with placebo after 48 weeks of treatment.

The FDA Returns to Its Bad Habits - WSJ (sorry but this is paywalled and I can’t find a link to get around the paywall).

The editorialist decries an FDA staff assessment of the results of the trial as “not exceptionally persuasive” based on the following description of how the FDA makes its decisions.

The agency traditionally requires a statistical measure called the p-value—the probability of obtaining a result by chance—to be less than 0.01 to approve a drug based on a single study. Reata’s p-value was 0.014, which means that there was a 1.4% chance that the positive results was a fluke.

The editorialist points out that the 1.4% chance of the positive result being a fluke was:

……still small.

Comments on this interpretation of how the FDA allegedly interprets a p-value might be of interest.

f2harrell · February 21, 2023, 11:25pm

The FDA would never interpret a P-value that way (“chance it was a fluke” can only be assessed if you switch to a Bayesian method). WSJ needs to get their act together. But there is often an opinion at FDA concerning use of \alpha=0.01 for a single study. But even then you’ll see common sense prevail when p is just above 0.01.

ESMD · February 22, 2023, 12:02am

This seems to be the study that generated the p-value you cited:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7894504/

“ The primary outcome was change from baseline in the mFARS score in those treated with omaveloxolone compared with those on placebo at 48 weeks. ”

The mFARS score is a 99 point scale.

“One hundred fifty‐five patients were screened, and 103 were randomly assigned to receive omaveloxolone (n = 51) or placebo (n = 52), with 40 omaveloxolone patients and 42 placebo patients analyzed in the full analysis set. Changes from baseline in mFARS scores in omaveloxolone (−1.55 ± 0.69) and placebo (0.85 ± 0.64) patients showed a difference between treatment groups of –2.40 ± 0.96 (p = 0.014) .”

“…the magnitude of observed improvements is equivalent to approximately 2 years of FA disease progression, as judged by similar cohorts based on age in a natural history (FACOMS) study.”

Studying rare, degenerative diseases is hard- I don’t envy the researchers. Families are often desperate for a treatment, yet recruiting a sufficient numbers of patients to run a trial can be very difficult. Furthermore, it might only be possible to assess the impact of a test drug in patients who are at certain stages of their disease. For example, once patients are no longer ambulatory, assessing the impact of the drug on physical functioning might become more difficult.

Another pitfall of small clinical trials is that important, but uncommon adverse events might not be captured. As a result, it can be hard to estimate any kind of “cap” for the risk of important/life-threatening AEs (e.g., important drug-induced liver injury). And if there is also a question about the clinical (as opposed to statistical) significance of the drug’s effect, the risk/benefit balance can become very difficult to assess.

I don’t have any specific comments about the study, other than to note that a “change from baseline” primary endpoint might not have been optimal (?)

As for the WSJ Editorial board, they need a refresher on the meaning of p-values:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4877414/ (esp. #2 and #3 under Common Misinterpretations of Single P values)

These interesting articles describe the history of evidentiary standards for approval of new drugs by FDA:

https://www.fda.gov/media/110437/download

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6169785/

This document discusses situations in which the “two trials with p<0.05” standard might be viewed more flexibly:

https://www.fda.gov/media/133660/download (see Statistical Considerations when the disease is rare).

A commentary on the “two trial” rule:

https://www.linkedin.com/pulse/two-stephen-senn

“In Statistical Issues in Drug Development I put the following possible arguments in favour of two trials.

The first is that we consider that the results are ‘unpoolable’: different things are being answered by the two trials and there is no common question which can be answered by both,…The second is that we fear some sort of ‘random gremlin’ which affects trials and occasionally leads to spectacularly low but unrepeatable P‐values which reflect not so much the effect of treatment but that the trial has ‘gone wrong’

Thus using two trials gives us the opportunity to study different aspects of efficacy and is some sort of protection against random gremlins. It provides a measure of robustness. However, this in turn implies that sponsors ought to be varying aspects of design. In many cases, however, they seem to be using very similar designs…”

ChristopherTong · February 22, 2023, 3:01am

Thanks to @ESMD for the historical references regarding FDA thinking. Of possible related interest is this note by some FDA authors:

An interesting passage:

The p-value is often defined as the probability of observing data as extreme or more extreme than those observed, assuming that a specified (null) hypothesis is true. A small p-value says only that observing data as or more extreme than those observed is unlikely, assuming the null hypothesis model is valid. Such a piece of information may allow a researcher to reject the null hypothesis, nothing more. Appropriate action upon rejecting a null hypothesis depends on the context of the finding, the study’s design, and especially, the relevance of the null hypothesis to specific research goals. In other words, a p-value may be calculated precisely but clinical relevance calls on other disciplines and other sources of data.

EpiMD5 · February 22, 2023, 5:52pm

So it seems the FDA is getting it mostly right and probably would not suggest that a p-value of 0.01 dichotomously distinguishes “flukes” from “non-flukes.” Wonder how to educate the editorialists? Or is this simply an attempt to sway public (and investor) opinion in order to influence the drug-review process in favor of approval and, cynically, the editorialist doesn’t want to be educated.

Without more familiarity with the mFARs measure of progression, the claim that the magnitude of improvement is “equivalent to approximately 2 years of FA disease progression” is hard to judge. An understanding of this measure of benefit may drive the “clinical relevance” assessment.

For devastating conditions with no treatments, the job of the FDA (and of drug developers) is very difficult. And the desperation of patients and families is palpable.

JosieS · March 2, 2023, 10:00am

It’s a very small trial, with a very high proportion excluded from the analysis:

> A full analysis set (FAS) was used for primary analysis of efficacy and limited to patients without pes cavus who had at least one postbaseline measurement. Safety analyses included all randomized patients (ARP).

There were 40+42 non pes cavus patients, and only one exclusion due to lack of postbaseline measurement. The trial registration doesn’t mention excluding pes cavus from the primary analysis (it doesn’t mention pes cavus at all): NCT02255435. That decision seems to have been based on part one of the study (reference 17), which is described in the same trial registration:

> Randomization was generated using a centralized interactive web response system, and stratified by pes cavus status (with pes cavus and without pes cavus) based on findings from a previous study.[17] Moreover, patients with severe pes cavus have a musculoskeletal foot deformity and may represent a subtype of FA with subtly different clinical phenotypes, so patients with pes cavus were included in the study but limited to 20% of subjects enrolled.

Ideally the trial registration would have a time-stamped amendment for part two but let’s shrug and move on.

It also doesn’t include a sample size calculation, which is an annoying deficit of this and most other trial registers. The registration states a target sample size of 172 for both studies combined. Part one reports 69 patients so the total numbers recruited do add up. It looks like they did achieve the original total target, but decided to exclude 20 from part two after part one was completed without upping the sample size to compensate for the exclusions (bringing the timing of these decisions into question).

In the paper, they report 85% power with 95% confidence to detect a difference of 2 points assuming an SD of 3.5 points. It looks a lot like a retrospective power calculation based on the sample they had:

> We calculated that with 80 patients, a longitudinal analysis would provide approximately 85% power to test a mean (± standard deviation) difference in the change from baseline in the mFARS of 2.0 ± 3.5 points between those randomized to omaveloxolone and those randomized to placebo, assuming a 2‐sided type I error rate of 0.05.

So what does p=0.014 mean in this context?

I know nothing of the background to this treatment but if it’s the first RCT aimed at measuring effectiveness then it’s not unreasonable to assume that the prior probability of the treatment being effective enough to be worthwhile is about 10%.

Using their reported power calculation and 2 points as the minimum clinically significant difference, and a simplified (dichotomised) calculation, this means that in 1000 trials of similar hypotheses, there would be 100 true positives and 85 of these would be detected (85% power) alongside 45 false positives (5% of the 900 trials where there is no meaningful difference to detect).

So the positive predictive value is 85/130 = 65%. The false discovery rate is 35%.

If you step outside the Neyman-Pearson framework and use the observed p-value instead of the pre-specified alpha, then you get 85 true positives and 13 false positives, a false discovery rate of 13%. Still a long way from what people often erroneously think p<0.05 means.

If they repeated the trial, the prevalence of false null hypotheses would now be ~65% and a second similar result would be a lot more convincing, with a false discovery rate of around 3% for the confirmatory trial.

This is part of the reason why we require two trials and if you’re going to step outside that paradigm [peers sternly at FDA], an alpha of 0.01 will not usually be enough, even ignoring other question marks over exclusions and pre-registration and no doubt other details I’ve missed. The dose-ranging study in part one doesn’t seem to add enough evidence to overcome this weakness.

The WSJ do not understand what a p-value is but the FDA are probably not being careful enough here, and they do have form.

Disclaimer: I don’t have a whole lot of time to tear this apart properly and I don’t have any clinical background to work with; apologies for any errors or omissions.

EpiMD5 · March 2, 2023, 7:22pm

Thanks for writing this. Here’s a link to the publication reporting on the trial results that is accessible online full-text.

https://onlinelibrary.wiley.com/doi/full/10.1002/ana.25934

I’d be interested in a comment on the imbalances in GAA repeat length and history of cardiomyopathy comparing placebo and active treatment and the results of the post-hoc analysis using ANCOVA shown in Figure 3.

JosieS · March 4, 2023, 9:36am

Imbalances with such a small sample (and a large number of baseline characteristics) are inevitable. How much they matter depends on how much they influence prognosis (and may raise the index of suspicion if there are concerns about the integrity of randomisation). They didn’t stratify the randomisation by anything other than pes cavus, which is unwise with such a small sample.

Stratification by multiple factors is also difficult with such a small sample but minimisation on a couple of important prognostic factors would have been possible, and would also have forced them to declare in advance which factors were considered most important.

A post hoc adjusted analysis is not an unreasonable thing to do under these circumstances but there are a lot of researcher degrees of freedom. There are other imbalances which may be of prognostic importance even if they don’t look as large. The imbalance by sex is very large and should probably be included in an adjusted model regardless. The way they report the post hoc analysis suggests they ran a bunch of different models and selectively reported these three. Ideally they would have pre-specified an adjusted analysis, including what variables were going to be included and how they would be modelled. Given that they didn’t, they should report a more complete set of results instead of just (apparently) cherry-picking these two factors.

They only had data on GAA for 67 of 82 randomised individuals and there was more missing data on the treatment arm (31/40 vs 36/42), so the GAA adjustment is particularly unreliable.

The reported imbalances in GAA and a history of cardiomyopathy both favour placebo, which is not suggestive of an intentionally or unconsciously biased randomisation procedure causing these imbalances. But these factors may have been highlighted in the post hoc analysis precisely because they moved things in the ‘right’ direction. Without deeper knowledge of the disease, or more complete reporting of the various adjustments they tried (or should have tried), it’s hard to comment further.