Evaluating treatment response predictive models separately per treatment arm

There was a community challenge for predicting response to anti-PD1 response prediction a few years ago. The results of the challenge was later published.

Not all patients with NSCLC achieve a response with ICIs. Consequently, there is a strong need for predictive biomarkers of outcomes with ICIs [9]. Studies reporting associations with ICI response in NSCLC have been limited by small sample sizes from single ICI treatment arms [17, 19, 20]. This Challenge addressed these shortcomings by using two large and well-characterized phase III RCTs and by comparing predicted responses between ICI- and chemotherapy-treated arms, thereby distinguishing treatment response prediction from prognostic effects.

The thing to be noted here is that they evaluated response prediction models from public datasets by applying them separately to each treatment arm of 2 clinical trials! Theoretically this seems very problematic to me because we know that treatment arms might not be identical at all due to complex treatment effects, drop-out patterns, side effects and so forth. The fact that the operating characteristics look different per arm is a very crude form of validation. What are all the ways that treatment arms might differ (for reasons only indirectly related to treatment assignment)? What are the known violations one might expect in trials like Checkmate? Do people in the field understand this?
image

Am I being too harsh in thinking this paper doesn’t seem to understand the fundamental problems of evaluating predictive biomarkers?

2 Likes

Correct, there is a lot of noise out there and it helps to be able to quickly separate it from rigorously detected signals in what is often a minority of manuscripts. Was lucky to attend a 2016 lecture by Bill Kaelin which majorly influenced my approach on this topic. He turned the lecture into this now classic paper. This rigor served him well as he received the Nobel prize in medicine a few years later.

2 Likes

Thanks for sharing that story. Apparently I had already read this paper a while bac :joy:, but my mental context was on a different problem.

1 Like

A pleasure! Here is also an example of a less noisy open challenge re: treatment effect heterogeneity. This was further discussed here as a commentary to this tantalizing article on frictionless reproducibility.

3 Likes

I very much appreciate the Frictional Reproducibility piece. I think it is important for stats/biostats to understand how the pace of work in other data science adjacent fields have taken off and how progress is limited by barriers to accessing data.

2 Likes

Great discussion. In general, trying to learn from separate analyses that do not used pooled datasets is frought with problems. A simple example is finding a big age effect in a study with a high sd(age) and not finding one in a study with a narrow age distribution. Doesn’t mean anything.

2 Likes