One common method in the analysis of seizure trial data is the ranked analysis of ANCOVA for evaluating a median reduction in the percent change (PCHG) from baseline in monthly seizure frequency. Attaching a reference here as an example, where the authors reported the median reduction from baseline in PCHG by treatment arms (placebo, 10mg, 20mg, 25mg), along with the Hodges-Lehman Boundaries for each active arm versus placebo (using proc npar1way) and p values from the ranked ANCOVA using a 1-sided test: Efficacy and Safety of XEN1101 in Adults With Focal Epilepsy
For their ranked ANCOVA I believe the following was performed:
Calculate the rank of the PCHG endpoint; calculate the rank of baseline seizure frequency (BASE). Ties are handled using mean value.
Run an ANCOVA model with rank transformation on PCHG and BASE: PCHG_rank = BASE_rank + treatment + other covariates
Initial test: there is a treatment effect for at least 1 active arm compared to placebo at p<0.05
If the above is rejected then pairwise comparisons are performed in sequential order for 25mg to placebo, 20mg to placebo, and 10mg to placebo at p<0.05 for each within the same ranked ANCOVA model.
Aside from the issues of the PCHG endpoint, is there any merit or validity to this approach for the analysis of seizure data?
Below are a few arguments that I’ve gathered to help steer people away from the approach used in the published article:
It appears that the authors reported the median (IQR) percent change from baseline (PCHG) by treatment arm and simply attached the p values from an ANCOVA model (with rank transformation applied to both PCHG and baseline seizure frequency) as their inferential statistics.
Argument 1: The ANCOVA model with rank transformation is not a statistical test comparing median PCHG. There is a disconnect between their claim on statistical significance and the endpoint.
Argument 2: The ANCOVA model with rank transformation showed low adjusted r-squared (i.e., poor model fit, little predictive value of the model).
Argument 3: For as long as the endpoint remains PCHG, there are no statistical methods that can rescue this endpoint. (Issues: not a symmetric measure, poor statistical properties, poor model fit shown in various modelling attempts; a list of references is attached below.)
Question: Are there other arguments that could be added to the list? Many thanks.
I am an epileptologist doing a network meta-analysis on outcomes of RCTs of antiseizure medications and wrestling with how to meta-analyze median percentage change from baseline. So I sympathize and agree with Frank that from a statistical point of view, there are better ways, nowadays.
But to answer your question, before one-sidedly trying to “steer people away” with purely statistical arguments you need to consider 3 things: 1) The nature of seizures and epilepsy; 2) The regulatory and practical framework for the trials; and 3) How appalling meaningless numbers are to patients and physicians.
Regarding the first two points, I read some of the links you sent, and one in particular caught my attention. Frank cites the CBD trial for Dravet syndrome by Devinski (Statistical Thinking - Statistical Errors in the Medical Literature) and critiques the authors with terms like “the authors failed to recognize”, “the authors engaged in improper […]”. You need to understand that RCTs for medications are regulated, and that includes the statistical analysis! In the case of antiseizure medications, the FDA requires a parameterized measure of seizure frequency (which historically has been median percent reduction in most trials) as the primary efficacy outcome and the European Medicines Agency requires a dichotomized one (50% responder rate) as the primary efficacy outcome AND median percent reduction as a secondary outcome (the references are below). They also require that the clinical trial be a parallel RCT, yes, reporting within- and between-group comparisons. The reason for that is the nature of disease. No, seizure frequency does not follow a nice distribution like the Poisson, it is quite chaotic and a patient may have 1 seizure in the baseline period and 20 in the post baseline, while another may have 200 and 1, respectively. Randomization doesn’t completely assure baseline balance because it is not possible to stratify by the features of the disease in the vast majority of pivotal trials. There are many other features of epilepsy and seizures (those are distinct concepts) that I can’t get into but include how heterogeneous the disease is, the many types of different seizures, and how these issues affect individual patient responses to medications. Epilepsy is really Epilepsies, in the same way there are many types of cancers. Imagine doing a trial for “The treatment of cancer” when there are some many types. The one thing that the epilepsies have in common is the main treatable symptom which are seizures (many types).
By the way, in the trial you cited of XEN1101, it’s probably poorly written, not mistaken. They are reporting (1) the p value for the primary efficacy outcome based on linear contrast for the least square means (2) percent reduction with IQR and (3) the p value of the comparison Drug vs. Placebo. More details of the statistical analysis in the 146-page supplement here
References in a separate post as it will not allow me to insert more than 2.
Just a general response: The fact that some regulatory agencies are used to accepted substandard statistical methods should never be interpreted as recommending that a sponsor should use those. Sure the sponsor needs to make strong, clear arguments and cite the literature or simulation studies, but such battles are definitely worth fighting are are often winnable.
The percent change (PCHG) endpoint may have originated from a clinical need for tracking individual patients’ seizures compared to their baseline over time. But it’s not a valid endpoint for statistical analysis regardless of the disease area. A simple exercise would be to run a regression on PCHG and include baseline values as a predictor (could try with non-linear smoothing). One may find that it’s not predicting PCHG at all when baseline measurements are expected to be a strong predictor for the outcome. One may find that the baseline measures will predict the raw or properly transformed follow-up measures a lot better, as expected.
Another issue with the percent “change from baseline” concept is that the purpose of a parallel-group RCT is to compare the parallel groups, not individual subjects with themselves at baseline. Profs F. Harrell and S. Senn have written elsewhere about this topic.
The purpose of randomization isn’t to balance patient characteristics, it’s to remove bias. As a result of randomization, patient characteristics tend to be balanced over all possible randomizations, but there could be some imbalance in any given trial. I don’t feel that “imbalance” is really an issue for RCTs. The key is to include strong/known predictors of the outcome in the model (ie, covariate adjustment), which will improve precision for the estimated treatment effect and efficiency of the model.