It’s a very small trial, with a very high proportion excluded from the analysis:
> A full analysis set (FAS) was used for primary analysis of efficacy and limited to patients without pes cavus who had at least one postbaseline measurement. Safety analyses included all randomized patients (ARP).
There were 40+42 non pes cavus patients, and only one exclusion due to lack of postbaseline measurement. The trial registration doesn’t mention excluding pes cavus from the primary analysis (it doesn’t mention pes cavus at all): NCT02255435. That decision seems to have been based on part one of the study (reference 17), which is described in the same trial registration:
> Randomization was generated using a centralized interactive web response system, and stratified by pes cavus status (with pes cavus and without pes cavus) based on findings from a previous study. Moreover, patients with severe pes cavus have a musculoskeletal foot deformity and may represent a subtype of FA with subtly different clinical phenotypes, so patients with pes cavus were included in the study but limited to 20% of subjects enrolled.
Ideally the trial registration would have a time-stamped amendment for part two but let’s shrug and move on.
It also doesn’t include a sample size calculation, which is an annoying deficit of this and most other trial registers. The registration states a target sample size of 172 for both studies combined. Part one reports 69 patients so the total numbers recruited do add up. It looks like they did achieve the original total target, but decided to exclude 20 from part two after part one was completed without upping the sample size to compensate for the exclusions (bringing the timing of these decisions into question).
In the paper, they report 85% power with 95% confidence to detect a difference of 2 points assuming an SD of 3.5 points. It looks a lot like a retrospective power calculation based on the sample they had:
> We calculated that with 80 patients, a longitudinal analysis would provide approximately 85% power to test a mean (± standard deviation) difference in the change from baseline in the mFARS of 2.0 ± 3.5 points between those randomized to omaveloxolone and those randomized to placebo, assuming a 2‐sided type I error rate of 0.05.
So what does p=0.014 mean in this context?
I know nothing of the background to this treatment but if it’s the first RCT aimed at measuring effectiveness then it’s not unreasonable to assume that the prior probability of the treatment being effective enough to be worthwhile is about 10%.
Using their reported power calculation and 2 points as the minimum clinically significant difference, and a simplified (dichotomised) calculation, this means that in 1000 trials of similar hypotheses, there would be 100 true positives and 85 of these would be detected (85% power) alongside 45 false positives (5% of the 900 trials where there is no meaningful difference to detect).
So the positive predictive value is 85/130 = 65%. The false discovery rate is 35%.
If you step outside the Neyman-Pearson framework and use the observed p-value instead of the pre-specified alpha, then you get 85 true positives and 13 false positives, a false discovery rate of 13%. Still a long way from what people often erroneously think p<0.05 means.
If they repeated the trial, the prevalence of false null hypotheses would now be ~65% and a second similar result would be a lot more convincing, with a false discovery rate of around 3% for the confirmatory trial.
This is part of the reason why we require two trials and if you’re going to step outside that paradigm [peers sternly at FDA], an alpha of 0.01 will not usually be enough, even ignoring other question marks over exclusions and pre-registration and no doubt other details I’ve missed. The dose-ranging study in part one doesn’t seem to add enough evidence to overcome this weakness.
The WSJ do not understand what a p-value is but the FDA are probably not being careful enough here, and they do have form.
Disclaimer: I don’t have a whole lot of time to tear this apart properly and I don’t have any clinical background to work with; apologies for any errors or omissions.