Electroencephalographic signature predicts antidepressant response in major depression?

Nature Biotechnology has just published An electroencephalographic signature predicts antidepressant response in major depression by Wei Wu, Yu Zhang, et al. Gustav Nilsonne has posted a series of tweets about it which I hope that others can elaborate on. Gustav has pointed out what seem to be major flaws in the analytic approach.

It is ironic, but not uncommon, that a paper using advanced machine learning methods fails to get the simplest things right. Key to predicting response to depression therapy is using the response variable correctly. In antidepressant drug trials, when one plots, say, 12 week Hamilton D depression scale against baseline HamD, the resulting plot is extremely nonlinear. This nonlinear relationship is caused by patients with severe depression (high HamD) having much larger drug response than those with smaller HamD. This implies that change from baseline for HamD is meaningless. The baseline needs to be kept in context. This is easily done by fitting a proportional odds ordinal logistic model that is tailored to this situation:

Follow-up HamD = restricted cubic spline in baseline HamD + treatment effect

In the Nature Biotechnology paper, the authors improperly used ordinary change from baseline HamD, so they failed to notice that the most important predictor of change in HamD is the baseline HamD. And failure to take baseline HamD into account distorts their analysis is unknown ways.

Since the response variable used in the paper is defective, it is possible that what the machine “learned” is how to predict baseline Hamilton D. But we do not need to predict what is observable.

I note that no biostatisticians were involved in the paper. In this biostatistician’s opinion, it shows. I am also wondering about the peer review system at Nature Biotechnology.

General problems with the use of change scores are detailed in BBR Section 14.4.


In psychiatric research, normaly, the relative change in percentage are used with about 50 percent decrease in HDRS for a response criterium. This overcomes the problem with these absolute values and the association between baseline values and retest values.

That does not solve the problem. Hamilton-D is not a scale that operates on a proportionate basis. And whenever a follow-up value is nonlinearly related to the baseline measurement, neither log ratio nor a simple difference work properly.


Thank you so much for the post, Frank!

I could see this issue happened a lot in the medicial journals.

This reminded me of a post by Dr. Harrell in another thread that interested readers might want to check out. I need to study this (as well as Regression Modelling Strategies) in more depth as it suggests a widely applicable method for heterogeneous conditions (ie. CVA) that are prevalent in the rehabilitation literature.


Thanks Frank for an illuminating post. I strongly dislike using difference from baseline as a ‘response’ variable, as you know. Quite apart from anything else, the baseline on its own has no causal influence from treatment, so putting it on the left hand side of a regression equation (albeit smuggled in as part of the “response” definition) strikes me as perverse. The treatment affects the outcome. If the baseline helps to predict what the outcome would be in the absence of treatments or acts as a treatment modifier, it should be on the right hand side where it can be handled in a principled (and if necessary flexible) way.

Creating a dichotomy based on change from baseline simply adds further problems.
However, I wonder if we statisticians are not partly to blame for this mess. We have (and havehad for decades) good arguments in favour of sophisticated modelling of outcomes in clinical trials, yet a prejudice in favour of simpler (algorithmically) alternatives is commonly encountered and sometimes amongst statisticians. Other disciplines have crowded in to fill the space we have vacated. Fifteen years ago, arguing for more use of models in analysing clinical trials I wrote:
“So I will nail my colours to the mast. Where they disagree, I generally prefer the results of analysis of covariance to simpler models, of proportional hazards analyses to the log-rank test and of logistic regression to chi-square tests and, for that matter, the within-patient estimates of treatment effects in AB/BA cross-over trials to those based on the first period alone. Analyses that make intelligent use of prognostic variables can considerably increase the precision of our inferences.”

S. J. Senn (2005) An unreasonable prejudice against modelling? Pharm Stat,4, 87-89.


i worry that all the talk about loss of info from eg dichotomisation may have pushed us towards discussion about eg “optimal weights” for composite endpoints that maximise power. There seems a trade-off between cogency and sophistication of method and we don’t know where to position ourselves on this spectrum, fixating on one at the expense the other… we have now arrived at some confused middle ground where statos “enhance” crude ideas. Statos arent promoting simplicity, i believe they are just eager to please, positioning themselves in a service role among their clinical folk

1 Like

The Hamilton D depression scale is an example of a scale that was derived to maximize observable pharmacologic response. Not a good basis. I recommend against solving for weights that optimize power and instead would rather derive the most clinically meaningful (and meaningful to the patient) outcome scale that minimizes within-category heterogeneity. This will also happen to have improved power over current approaches.

i’m working on several studies with unfamiliar scales eg fatigue and cognition: patient reported, summing scores across questions. I read your pre-print (response to Nature paper) and then reread chp14 bbr. It’s quite useful in responding to requests for change from baseline. We have repeated measures, which i’ll leave as categorical, and random effects for sites. I’m reluctant to use proportional odds modelling because it wouldn’t be straightforward in sas(?), nonlinear mixed modelling, and i’m thinking of the audience. I’ve read Senn’s criticisms of the probability index, but how else to present results for these scales that are not meaningful? Eg a difference of 10 between groups=?? They typically use cohen’s d i think, defining regions of effect, which is not good. Im yet to see the end of BBR11, maybe i need to see it…

SAS fits the proportional odds model; just don’t know how easy the fit is to manipulate to produce derived quantities like the mean. With the prop. odds model you can get predicted means, quantiles, and probabilities to estimate patient-specific outcomes or differences in means etc.

i notice in the video you analysed a variable with many response values. But when does it become too much? the fatigue scale exceeds 1000. It’s ok?

Getting back to the main topic, you as statisticans have to keep in mind what really matters. In psychiatry, it is different from other fields of medicine. While e. g. in oncology, an objective measure of tumor growth is more reliable than subjective feelings of well being, in psychiatry the HDRS and its deviations from baseline pretty well give an insight in changes due to treatment, irrespective of baseline measures. Take this from a clinical point of view. Statistics are only one side of the medal. Maybe, one could use BDI measures as self questionnaires for that. However, there is a strong correlation between HDRD and BDI. If we could predict a decrease of one of those measures in combination worh different treatments, this still would mean a decrease of burden for the individual patient.

There’s no limit. My R rms package function orm easily handles 6000 distinct levels, as does JMP.

This in my mind does not justify using a change score at all, and when the relationship between baseline and final score is nonlinear, the clinician is just getting misled by using the change score. In spinal surgery we really nailed this down when we showed that patients don’t care at all about their change in disability level; at 1y post surgery they only care about their disability level now.

Ok, but in the study concerned, they showed a difference between two groups. The,y predicted the difference between two treatments. Even if there is a relation between baseline and post treatment measure ot differs for treatment groups.

What you say means physisians just should not rely on change scores juyt6 because they have a non linear association with change? No way on medical practise. This only implies the insuffience of the used measures or our understanding of psychiatric improvements. However, as long as we have no validated ground truth of “getting better”, the presented cenario is better than fitting an arbitrary curve of spline interpolation. Interpretation for this is impossible for a scale of different clinical items.

because, as Harrell says in BBR chp14 “analysis of covariance is required to “rescue” a changescore analysis”, i think the main (remaining) problem with change scores is that it’s heading towards “responder analyses”, which is almost insinuated by your phrase “getting better”. Stephen Senn has said a lot about responder analyses and there’s a good video out there of a presentation he did (for industry i think), he used data simulations, i found it very persuasive

1 Like

I see no justification for such statements and there is a simple fix that you are avoiding: show the patient their initial score and their current score and do not use subtraction to convert this into a single number. Subtraction assumes that (1) the score is perfectly transformed (2) the relationship is linear and (3) the slope is 1.0. You’ll almost never see all three of these conditions satisfied.