Individual response

Erin

I will try to show how government financial support could be used to make medicine more ‘personalised’ by improving the diagnostic process.

The medical profession has already been practicing ‘personalised medicine’ down the ages by arriving at diagnoses. I will show that Pearl and Muller have been trying to re-invent this wheel and basically getting a similar result using a very implausible example that made their reasoning difficult to follow. They claimed that they were addressing the outcome for a specific individual (hence @Stephen 's title of this topic being ‘Individual response’). However, what they were actually doing was dealing with a smaller subpopulation of the RCT subpopulation resembling that individual much more closely. We would describe these smaller subpopulations based on covariates as diagnostic categories.

The purpose of assessing individual response is to choose the plan of most benefit to the individual patient. Turning the clock back to create a ‘counterfactual’ situation and comparing the effect of treatment and placebo by doing this many times can be simulated with an N-of-1 RCT. The process of randomisation does the same job as a time machine. Comparing treatment counterfactually between groups of individuals using a time machine can be simulated with a standard RCT.

Choosing a test as a covariate based on a sound hypothesis before randomisation is basically to create a diagnostic criterion. We may thus create a positive test result that acts as a sufficient criterion that confirms a diagnosis If the latter is also a necessary criterion and therefore a definitive criterion, then a negative covariate test acts as a criterion that excludes a diagnosis. Those with a positive diagnosis are expected to respond better to treatment than those testing negative (the latter usually showing ‘insignificant’ or no improvement because they do not have the ‘disease’ usually). The degree of difference in the probability of the outcome with or without treatment reflects the validity of the test as a diagnostic criterion.

Pearl and Muller created an implausible diagnostic category not based on any reasonable medical hypothesis. Instead their positive diagnostic test was based on the patient ‘feeling positive’ about a treatment (and inclined to accept it if allowed to choose). A negative diagnostic test was those ‘feeling negative’ about a drug based on a negative feeling and inclined to reject it if allowed to choose). There was no plausible medical hypothesis for the potential utility of this ‘test’. However I shall refer to them as positive and negative test results to show how they behave as diagnostic criteria.

In their imaginary RCT on men, 49% of the total had a positive test and survived on treatment, whereas 0% of the total had a positive test and survived on placebo, showing a simulated counterfactual ‘Benefit’ of 49-0 = 49%. However, 0% had a negative test and survived on treatment and 21% had a negative test and survived on placebo showing a simulated counterfactual ‘Harm’ of 21-0 = 21%.

From this information we would advise a man with future positive result to accept treatment resulting in a survival of 49% but the negative testing man to refuse it resulting in a survival’ of 21% on placebo. By advising this for men in a subsequent observational study, 21% + 49% = 70% would survive.

In their imaginary RCT on females, 18.9% of the total had a positive test and survived on treatment, whereas 0% had a positive test and survived on placebo giving a ‘Benefit’ 18.9-0 = 18.9%. Moreover, 30% had a negative test result and survived on treatment and 21% had a negative test result and survived on placebo also giving ‘Benefit’ of 30-21 = 9%. Neither a positive nor negative result gave an outcome where placebo was better than treatment, hence ‘Harm’ in females was 0%. This gives a total ‘Benefit’ for females of 18.9+9 = 27.9%.

From this information we would advise a woman to accept a treatment whether she felt positive (giving 18.9% survival) or negative (giving 30% survival). By advising this for women in a subsequent observational study, 18.9 + 30 = 49.9% would survive. These figures are readily available from my Bayes P Map in post 224. (https://discourse.datamethods.org/t/individual-response/5191/224?u=huwllewelyn ) The P map representing the application of Bayes rule to the RCT data therefore confirms what Pearl and Muller found by using their mathematics.

I should add that many experienced physicians, especially endocrinologists, do not dichotomise test results into positive and negative, but interpret intuitively each numerical value on a continuous scale from very low to very high. This is done for the personal numerical result of each patient. There would be two such curves, one for patients on placebo and another for those on treatment (see Figure 1 in my post on ‘Risk based treatment and the validity of scales of effect’ https://discourse.datamethods.org/t/risk-based-treatment-and-the-validity-of-scales-of-effect/6649?u=huwllewelyn ).

At low / normal values the probability of an adverse outcome on placebo would be near zero with the treatment and placebo curves curve superimposed. This would provide near certainty in avoiding treatment as of no benefit for the purpose of personalised medicine. The remaining parts of the curve (see Figure 1 in my post on ‘Risk based treatment and the validity of scales of effect’ https://discourse.datamethods.org/t/risk-based-treatment-and-the-validity-of-scales-of-effect/6649?u=huwllewelyn ) would provide more personalised probabilities of outcome on treatment and control for each patient depending on the precise numerical result of a test than probabilities conditional on positive or negative results that lump together wide ranges of numerical results. If these curves were plotted for degrees of positivity / negativity from the example of Pearl and Muller, the curves would cross over implausibly at some the point.

Maybe the way forward for personalised medicine is for governments to support expert statistical input and other resources to help us clinicians to develop these curves and related concepts.

1 Like

Hi Huw

I appreciate your faith that the proposals in this paper will, at some point, permeate my thick skull if they are explained in different ways. Unfortunately, I think my brain resists attempts to make sense of arguments that hinge on wildly implausible medical/physiologic/behavioural premises and assumptions. It rebels against the futility of the whole exercise. Arguing that patient preference could serve as a reliable predictor of patient “responsiveness” to a treatment feels like a very deus ex machina way to overcome the barriers to personalized medicine. But it’s also very possible that I’m just not capable of thinking deeply enough about all this…

Since physicians have to make many treatment-related decisions every day, our decision-making framework needs to be compatible with rapid decision-making. I’m not saying that we won’t someday be able to fine-tune how we choose therapies for patients. Rather, I’m pointing out that if the process of “fine-tuning” takes an hour, it will never, realistically, be implemented. In that world, physicians would only be able to see two patients per day.

I see a lot more opportunity for “personalization” in fields where therapies have a combination of modest efficacy but guaranteed toxicity. Risk/benefit assessments can be challenging in these scenarios, so it’s important to elicit patients’ values and preferences (“utilities”) before recommending a particular course of action. But decisions of this complexity are not the norm in medicine. Instead, most physicians, without necessarily articulating exactly what they are doing, apply a more crude, yet efficient approach. They preferentially prescribe therapies with established intrinsic efficacy and then use informal versions of “N-of-1” trials to fine-tune their choices over time.

2 Likes

I agree that we make clinical decisions for the individual patients quickly and intuitively by taking the preferences or utilities of patients into account. The process is often iterative by trying different diagnoses or treatments / doses to optimise outcome, which is what I understand by ‘personalized’ medicine. This depends on reliable information that predicts outcomes with as much certainty as possible to minimise the number of iterations required.

This discussion was about a different matter. It was how to assess the ability of diagnostic information to allow diagnoses and decisions to be made as successfully as possible by leading to favourable outcomes as often as possible. If diagnostic tests and treatments are of little or no value, then we should not use them. The way diagnostic tests are assessed at present is very confused as illustrated by the bizarre example used by intelligent lay people and the discussion here that required a huge effort to disentangle. Poor assessment of diagnostic tests leads to a waste of money and clinicians’ time and needs sorting out. As you say, there is a danger that funding will be taken away from more rational approaches unless the latter can assert themselves.

3 Likes

I’m completely sympathetic with your feeling here, Erin. My impression from the outset had been that the P&M argument was simply bad-faith shitposting, and that everybody responding to it on this thread was being taken for a ride. Your recent post in which P&M apparently double down on all this suggests now a different phenomenon is at work.

1 Like

Perhaps P&M did not set out in bad faith. Perhaps they were carried away by thinking that they had discovered something new and exciting to improve medical diagnosis and decision making by adding observational studies. They made many mistakes by trying to prove this using an implausible example and tortuous and erroneous reasoning. For example the RCT control group was identical in females and males so that P(y) – P(y_c) for females and males should be the same. However, for females they came up with an answer of 0.09 but instead of being the same for males it was 0.49. This is how they arrived at the ‘correct’ values p(Benefit) and P(Harm), - by making a mistake when calculating one of them. We would only be able to know which was wrong if they had not failed to specify the value of P(y) (which could have been 0.3 or 0.7 when the value of p(y_c) was 0.21).

The point is that they created the RCT and observational study results and the values of P(Benefit) and P(Harm) by allocating values to them. I have shown already that these values could have been estimated directly from the RCT using covariates without having to resort to an observational study. I have already raised this with P&M on twitter in the past but they do not appear to have understood. Erin pointed out that Judea Pearl continues to publicise their discovery in the preface to a new book this year, claiming to be the way forward. On the contrary, I think it is by a better use of covariates in diagnosis and treatment selection, especially by analysing their numerical values carefully prior to any dichotomisation.

2 Likes

David

You might be right. If so, it’s important that criticism be properly contextualized. Questionable claims made in good faith by otherwise brilliant people, late in life, should not overshadow an impressive life’s work. I think that everyone in this forum would agree that compassion in such circumstances is essential. But it’s also important not to just sweep under the rug the harms that can flow when prominent figures dabble in fields they don’t understand. The COVID pandemic provided many disturbing examples of this phenomenon. People pay attention when prominent people speak, even if their ideas are (dangerously) off track.

I suspect that the author would have been more circumspect if he understood just how easily claims related to “personalized medicine” can be co-opted, in the modern world, by profit-seekers who lack any concern for patient welfare. These are not flames that most healthcare providers want to fan…

3 Likes

Quite possible the post-hoc power team from MGH didn’t set out in bad faith, either; but we know how that ended up. The crucial principle is to keep the faith.

Indeed, and this the only reason I even addressed P&M in this thread. Their argument effectively “floods the zone” Steve Bannon style, “stinking up the joint” for people who are keeping the faith with the concepts of individuality and person-centeredness in medicine.

2 Likes

You make another good point. There is more than one reason why people who actually work in the healthcare field might find claims like the ones in this paper so frustrating. Many physician scientists likely have a deep understanding of promising potential niches for more individualized approaches to therapy (e.g., dose selection to optimize risk/benefit; research to identify tumour markers that could result in development of new therapeutic targets…). But the expertise of these highly specialized applied scientists constrains them from making exuberant claims that might otherwise swallow huge sums of funding money and mislead upcoming young scientists.

2 Likes

Some thoughts I had about how to approach benefit bounds from a bayesian perspective!

Richard McElreath published this cool video about how Bayesian statistics is just counting all the ways data can be produced given our assumptions:

Basically causal bounds for personalized net-benefit are not different. Assuming monotonicity, the estimated probability of Benefit must be bounded by

0 ⩽ P(benefit) ⩽ P(t, y) + P(c, y′)

Where P(t, y) stands for the observed treated patients with a positive outcome while P(c, y’) stands for the untreated patients with a negative outcome.

From McElreath perspective, it’s just the range of ways that the data can be produced.

Treated patients with a bad outcome [ P(t, y’) ] and untreated patients with a good outcome [ P(c,t’) ] could never be compliers, they must be never-takers or always-takers and could have no benefit from the treatment.

This is really cool because we don’t need any DAG nor prior-probabilities for this first steps in our estimation process.

Most prediction models assume untreated population implicitly, that’s why we would like to have higher True-Positive / Lift / ppv / NB values.

But in fact they are equivalent to the upper bound of the causal bounds if the dataset stands for untreated population with simple transformations.

2 Likes

“Treated patients with a bad outcome [ P(t, y’) ] and untreated patients with a good outcome [ P(c,t’) ] could never be compliers, they must be never-takers or always-takers and could have no benefit from the treatment.”

These types of deterministic statements will simply not resonate with physicians. The idea that patients are somehow “hardwired” to respond the same way to a treatment each time they are exposed is simply false in the vast majority of clinical situations. Clinical scenarios in which a patient’s response is highly predictable from one treatment episode to the next because it depends strongly on the patient’s “ingrained” immunology/genetics are actually relatively UNcommon. The overwhelming majority of treatments that physicians apply on a daily basis are NOT characterized by this degree of response predictability.

There’s been so much back-and-forth about the article that originally prompted this thread that it might be helpful to catalogue the exchange chronologically (some versions of the papers were initially published on arXiv and later in journals so it’s a bit hard to keep track of the entire exchange):

Original M&P paper:

Response from Senn and Dawid:

https://arxiv.org/pdf/2301.11976

Response from Stensrud and Sarvet:

https://academic.oup.com/aje/article/194/6/1743/7226668

M&P response to Stensrud and Sarvet (paywalled):

https://academic.oup.com/aje/article-abstract/194/6/1749/7909739?redirectedFrom=fulltext

Stensrud and Sarvet rejoinder (paywalled):

Arxiv version: https://arxiv.org/pdf/2403.14869
Paywalled published version: https://academic.oup.com/aje/article-abstract/194/6/1752/8107613

The limitations of deterministic potential outcomes that were discussed in Andrew Gelman’s 2021 blog (see post # 235) prompted him to publish a followup paper on the topic. It seems to speak to the heart of the criticism of the original M&P paper with regard to implications for optimal decision-making:

https://sites.stat.columbia.edu/gelman/research/unpublished/russian_roulette.pdf

3 Likes

What exactly do you mean by “each time”?

Good question. Sometimes, a patient will seem to respond to a treatment initially, but the treatment may lose effect over time e.g., antibiotics, cancer treatments, antidepressants, painkillers,…And sometimes, for diseases that are “episodic” e.g., major depression, a patient who didn’t improve after treatment during a previous episode may improve when the same drug is tried during a subsequent episode (and vice versa).

I don’t look at my patient panel and label people as “responders” or “non-responders” to their various treatments- this type of classification is simply not valid for most diseases and treatments. But nor do I disregard the fact that a patient has seemed to benefit previously from one treatment but not another (though, for waxing/waning conditions, the causal role of a treatment in the patient’s improvement is often hard to gauge). If a patient got better after I prescribed a certain treatment for their condition in the past, I’m likely to select the same treatment for them the next time. But if they don’t get better with subsequent treatment episodes, I’m not overly surprised.

2 Likes

I think you aim at a more complicated problem that requires additional assumptions such as sequential exchangeability.

I see causal bounds as a baseline when the outcome is binary and well defined for a fixed time horizon. For this task, “Consistency” is enough.

Regardless you might like this wonderful work by Ruth Koeugh:

2 Likes

Thanks for the link, but, given my educational background, I don’t understand the nomenclature and math.

Even though I don’t understand what “sequential exchangeability” is, I think you have identified the crux of the problem here. Human biology and behaviour are a heck of a lot more complicated than M&P seem to acknowledge. M&P seem to know a lot about math and symbols but their framework betrays little understanding of the complexity of human physiology/biology. Conversely, most clinicians will have very little understanding of math/symbols but a very deep understanding of human physiology/biology/behaviour.

This thread, and the published rebuttals to M&P linked above, document significant push-back from statisticians and clinicians on the M&P framework. Unfortunately, the authors don’t appear to have been receptive to the feedback. But, fortunately, a decision-making framework that hinges on a rigidly deterministic view of human physiology and behaviour (and how these interact to determine treatment response) will never be adopted by the clinical community.

3 Likes

Honestly, you seem to know a lot about math! :innocent:

I think the pushback has more to do with cultural differences than with actual disagreement over a “deterministic view.” In fact, I didn’t see much pushback specifically about consistency here — but I’ll make an effort to reread the thread.

P.S.
Consistency is also debated within the “causal” camp: Pearl disagrees with the potential-outcome framework. You might find this interesting:
https://escholarship.org/uc/item/6nv2744w

2 Likes

I’m surprised no one mentioned this article by Philip David & @Stephen :

2 Likes

Wow this paper is impressive. I hope that Mueller or Pearl respond to it.

It is also interesting to see the criticism towards counterfactuals by AP Dawid back in the 2000 under the name “Causal Inference without Counterfactuals”

And Pearl’s response:

2 Likes

The Dawid/Senn paper was first cited back in post #222 in this thread, and again in post #250. There’s also a lot more back-and-forth about the M&P paper involving Stensrud/Sarvet- I’ve linked to all the relevant responses in post #250.

3 Likes

Correct me if I’m wrong but it appears that Philip David & @Stephen are claiming ethical superiority of their preferred methodology. However, there is no virtue in claiming ethical superiority once the bodies have already been laid to rest…and they have.

Are not these conclusions, especially to the extent of ethicality, based on DT as optimal given a valid causal model; but if L is non-causal or incoherent under intervention (do(X,L)), a clinician’s intuitive counterfactual reasoning may well outperform formal DT predictions.

We observed that with the COVID pandemic. The DT conclusion was that EARLY mechanical ventilation based on the PaO2/FIO2 (P/F) was indicated with P/F adjusted PEEP tables. This instead caused considerable death and precipitated the revolt in 2020 by the bedside clinicians against the RCT based ventilator protocol, (which has now been abandoned). The reason was that Y= do(X,L) was incoherent with Y=(X=x,L).

It is often difficult at the beside to see that the relevant covariate mix was fatally different from the tested covariate mix.

IMHO no analysis which presumes valid transportability of RCT can be promulgated as ethically superior as this leads to a false sense of superiority in all instances. It was this false sense of security which caused the most faithful to RCT based EBM (who most assuredly perceived ethical superiority) to try to hold the line in favor of the deadly DT “evidenced based” ventilator protocols.

Obviously there are times when counterfactual based on physician assessments are superior and it is obvious that, provided the RCT is fully transportable, this is the optimal choice. We at the beside often are required to make decisions somewhere between those extremes.

Introspective words relevant the potential weaknesses of DT at the bedside so that practitioners must be “cautious and humble” is not mitigating relevant a claim of ethical superiority in 2023, only 2years after the world suffered the greatest DT failure in the history of mankind. The post failure era is the time to determine the “why”. Specifically why despite such beautiful internal validity, DT can fail the public in such a catastrophic way. Actionable epistemic humility seeks help. IMHO, that’s where the intellectual effort should be most greatly focused but instead there is discussion of “cheating way round this” by assuming a hypothetical “exchangeable” target. That did not work in 2020.

The use of cSM as a tool to enhance RCT design (and mitigate the potentially deafly problem of poor RCT transportability) would be a good place to start. Work together to define synergy with a common goal of improving the research ASAP for clinicians would be IMHO a much more clinically efficacious use of all of this unmitigated brilliance.