Individual response

Yes. It’s important not to indict the RCT method because some people, historically, haven’t designed them with enough foresight/care.

Assay sensitivity seems to be the key ingredient missing in the design of many of the RCTs Lawrence is concerned about. “Post-operative status” isn’t a disease, so maybe it’s not a great inclusion criterion for an RCT (in contrast to e.g,. acute occlusion MI or gallstone pancreatitis).

Randomly assigning an experimental intervention to subjects who are in a “physiologic state” that can be arrived at by many different biologic pathways, without a deep understanding of the prognostic distribution of untreated subjects, isn’t an ideal approach. If we could turn back the clock to a time before perioperative pulse oximetry became routine, we could maybe imagine a better way to design RCTs to reveal its benefits. For example, a reasonable first step might have been careful analysis of cases involving patients who died suddenly in the perioperative period while not being monitored. Attention to cause of death and measures that might plausibly have averted a bad outcome (e.g., a pulse oximeter alarming), might have identified a patient subset that is more likely to benefit from oximetry. After that, an RCT aiming to corroborate a benefit for oximetry could have been enriched with higher risk patients. Observing more adverse outcomes might have allowed any intrinsic benefit of oximetry to be detected more efficiently. For example, maybe a trial that enrolled only post-op patients with COPD or neuromuscular disease could show a benefit, whereas an RCT involving “all-comers” in the post-op period, would not.

The above process sounds logical enough. But once a medical practice has become firmly established, there will be many who argue that clinical equipoise has been lost. This is especially true if the intervention is cheap and doesn’t use a lot of resources, where downsides to empiric intervention, even without RCT “proof” of benefit, are hard to fathom, and where the potential consequences of not intervening are serious. The second to last point is the real nub of the issue. Often, some stakeholders perceive potential downsides to an intervention, where other stakeholders don’t (see the endless debate re mask mandates during the pandemic). In the case of pulse oximetry, every anesthetist probably can recall a few cases where a pulse oximeter was the first indicator of a patient’s unanticipated abrupt postoperative decompensation- those types of cases probably stick with a person for a very long time…

Finally, it seems important not to start seeing the potential for qualitative interactions everywhere we look. While their presence might be more plausible in a poorly-designed RCT that has lumped a pile of patients together who have no business being part of the same experiment, a well-designed RCT, focusing on patients with more homogeneous disease (e.g., acute occlusion MI) would probably be much less likely to involve important treatment by patient qualitative interactions.

Arguing that EVERY ostensibly neutral RCT plausibly might be “hiding” signals of efficacy that have simply been “obscured” by qualitative interactions, assumes that EVERY treatment we can imagine plausibly has the potential to benefit some patients and harm others- we just need to keep examining people on a more and more granular level in order to distinguish “responders” from “non-responders.” But of course, this argument is susceptible to infinite regress and isn’t a realistic basis for approving new drugs and devices.

4 Likes

Yes, here they were studying a continuous testing device generating a dynamic testing result. Such dynamic testing invariably transition from true negative to false negative to true positive. The period of false negativity poses risk to any subset of patients requiring time sensitive intervention because it induces a false sense of security which may cause a delay in critical, time sensitive, intervention .

Now we see the mix of pathophysiologies renders a mix of patients wherein the period of false negativity is long (harm) and wherein it is short (benefit). So this is the same fundamental problem I discussed previously in relation to RCT of poorly defined synthetic syndromes, like sleep apnea, sepsis, and ARDS.

The key here is that there has been decades long a general misunderstanding of the applicability of RCT in the investigation of testing or intervention in poorly defined populations where the measurement (e.g. “all patients having procedure X”) is not a valid measurement for the treatment or test being studied.

Finally, the dynamic behavior of the individual confusion matrices in specific relation to the range of pathophysiologies under test must be understood. All of these require deep observational research to learn the dynamic relational patterns of the target adverse conditions. The idea that an RCT can routinely replace discovery in complex heterogenous environments is not true. OS are the source of requisite initial discovery.

I bring this to this “individual response” discussion because it shows that the nuanced relationship between individual harm and individual benefit, how these can be routinely hidden within the average and how this can result in wrongful conclusions. Applying the test in an RCT to a select population most likely to benefit might bias the result towards benefit if the test is then applied to a more broad population. because the number of those most likely to be harmed might be diminished.

These fundamental considerations underlie the potential effects of the severity disparities induced by choice. Unless the OS is constructed with an informed and narrow focus, not only might the severity be different in the refusniks, the pathophysiology itself might be different.

I cannot see why such a delay would be attributed to oximetry rather than to the monitoring process itself? If we assume that the monitoring process was indeed fully understood by the study investigators then the question that should be asked is why such a study was conceptualized and actually done?

Pulse oximetry and the “monitoring process” are the same thing here. The investigators were trying to determine if perioperative pulse oximetry reduced complications which many take for granted (hence the parachute analogy). .They failed to understand how the broad entry criteria (measurement) might effect the heterogeneity of treatment effects (HTE) (as explained below) and probably failed to recognize that pulse oximetry can cause harm. This lack of understanding of the relationship of “individual response” (and particularly HTE) to the entry criteria is ubiquitous.

I bring this to this “individual response” discussion because HTE underlies the type of analysis under discussion here. If the entry criteria (measurements) are broad and include different groups of pathophysiologies (eg anxiety, depression, Sleep apnea, sepsis, ARDS, the perioperative state) then HTE is high and the average treatment effect (ATE) will be biased by the subset mix of the pathophysiologies captured in the instant RCT or OS. This may be a much larger effect than “refusenik bias” of the OS. However, if, in the alternative, the criteria were narrowly chosen with a measurement which reliably captures the target pathophysiology so the relevant (target) pathophysiology is generally present in the study population (eg a throat culture positive for group A strep in the investigation of the efficacy of a new antibiotic) then refusenik bias would rise as a more relevant issue…

HTE is a function of the entry criteria (measurement). HTE as it relates to the mix of captured groupings of pathophysiologies can have a similar effect on OS so they key here in some settings is to use the OS as exploratory to find the measurements which identify the target population so the RCT can be narrowed with a reliable measurement (eg a biomarker or other mathematical tool)…

Oncologic RCT have evolved over the past 2 decades to reduce the HTE by narrowing (and rendering more homogeneous) the population under test relevant the target pathophysiology. Many other disciplines have failed to make these moves and continue to produce useless RCT cargo cult ruminations (pathological science of Langmuir). .

My point is that HTE as a function of broad entry criteria induces the potential for high bidirectional "individual response’ and marked individual response heterogeneity which may dominate and render refusenik bias moot in many fields. .

So lets actually look at a couple of potential “individual responses” to continuous Pulse Oximetry monitoring. **(By the way knowing these things may help you next time the nurse discounts your concern about a loved one’s shortness of breath by pointing out that your loved one’s pulse oximetry readings are satisfactory).

Pulse Oximetry Harm
This is a common harm by classic false sense of security caused by reliance on the pulse oximetry signal as indicative of respiratory stability, since the patient progressively increases ventilation in the Type I pattern moving more air (and therefore more oxygen) into the lungs so the oxygen saturation by pulse oximetry (SpO2) may remain normal till near death.

Here I present a typical Type I Pattern of Unexpected Hospital Death (PUHD). This is a typical relational time series pattern of signals in sepsis, congestive heart failure and other common conditions. Note the SPO2 (oxygen saturation by pulse oximetry actually can rise early in the death process and then can remain stable and normal till close to death.) (This is caused by compensatory response of increasing ventilation volume (Ve) and respiratory rate (RR).

Note the potentially fatal false sense of security provided by the pulse oximeter. Of course, death may not occur, rather there may be complications such as organ injury or prolonged hospital stay due to late detection.

Pulse Oximetry Benefit
In stark contrast (but studied in the same black box RCT) here is a Type II PUHD. Note here pulse oximetry will provide benefit because the SPO2 falls early in the death pattern providing early warning potentially preventing death or complications. (Particularly if the patient is not receiving supplemental oxygen). If the SPO2 falls early

Now the details are different from the @scott example cited by @Stephen but the fundamentals are the same.

There is individual benefit, individual harm, and potential for refusal (which refusal may benefit or harm). The harm or benefit cannot be predicted prior to the study without more knowledge (which might be provided by a prior high density data OS ). The authors are probably unaware that the individual pathophysiology here affects the probability of harm or benefit. The 2 RCT could be done at different centers or on different wards with different pathophysiology sets…

You see how the trialists and statisticians were certain of the validity as long as N was sufficient and compliance was good enough… RCT in black box format (i.e. Does X cause benefit or harm when applied to population Y, as defined by broad and/or capricious, criteria Z?)

Here we see the contrast and consequence of this during the pandemic.

Does dexamethasone improve survival when given to patients with ARDS?
RCT answer-No

Does dexamethasone improve survival when given to patients with ARDS due to COVID?
RCT answer-Yes

How do you reconcile that without deciding “RCT for ARDS” is a pitfall? The answer is the same as for pulse oximetry. Combining the pathophysiologies by using “the perioperative state” as criteria rendered a SET with a mix of individual responses which averaged out to show no harm or benefit for both the pulse oximetry RCT just as occurred with the pathophysiologic mix under ARDS RCT test, as derived from the threshold based nonspecific criteria of the many ARDS RCT. The mix of individual responses in the SET under test determined the average treatment effect of the study. However that ATE may be markedly different then the average treatment effect of another population defined by exactly the same broad criteria. . .

So “individual response”, the focus of this thread, is not just a function of the individual but also a function of the specific pathophysiology affecting the individual, but more importantly it is a function of the percent mix of individual in the SET under test without the fundamental target pathophysiology but which meet the criteria for entry into the RCT.

Finding means to narrow the criteria for the population under test to those with the target pathophysiology is the first step, before deciding if a valid RCT can be reliably performed…

Here, the synergies between OS are RCT are clear. OS, may provide an incorrect answer due to lack of randomization or suitable controls. Yet OS should be performed in complex populations because they provide much greater density of time data, for example time series matrix data from EMR, to render transparent potential individual responses. This was shown in the instant example wherein, before the RCT can be reliably performed, one must learn the lessons of the pathophysiologic basis for pulse oximetry benefit or harm.

During the COVID pandemic most trusting in evidence based medicine were absolutely sure they had RCT evidence that corticosteroids did not work for COVID ARDS and were highly critical of its empiric use. (Given that death was often due to an overwhelming inflammatory response, empiric dexamethasone would have made pathophysiologic sense if the RCT did not exist). The number of lives lost due to that false EBM was likely quite high. We could not have known that, or what to do, but, like the pulse oximetry in Type I pattern of unexpected hospital death, we had a false sense of security that dexamethasone would NOT work. Perhaps clinicians would have been less sure, if everyone was a little more forthcoming about the potential weakness of RCT when applied with broad criteria. This is what failure to consider the heterogeneity of “individual response” as noted by @scott can cause false EBM and harm to the public.

We love RCT based EBM, its our base. Yet lack of reform is robbing EBM of its standing. This article shows the negative drift of the image of EBM, which could be prevented by converting into objective terms the qualities required for entry criteria.

I know I am off track from what the group wants to talk about in this thread and its a great thread and I do not want it to be ended. I will cease so you can get back to it. Regards.

Ref.

I can see how important patient/treatment qualitative interactions could be missed as a result of poor RCT design (e.g., inappropriate “lumping” of patients in disparate clinical states into a single trial). Failure to do adequate preparatory study to optimize disease definition, trial inclusion criteria, and measurement tools would be analogous to a drug company skipping preclinical or early phase clinical studies and jumping to phase III- the chance of success would be very low (see below).

I’m not sure whether this problem (which seems much more prevalent in certain medical specialties than others) could be described as suboptimal “construct validity”(?) Whatever it’s called, we’ve discussed how it could lead to noisy trial results, with a net benefit in some subgroups plausibly being obscured by net harm experienced by other subgroups (yielding an overall neutral trial result). Having said this though, I suspect that poor construct validity probably isn’t the “rate-limiting” step in the effort to discover efficacious new therapies in most disease areas. The fact is that it’s really hard to discover new treatments, even for stakeholders with every possible resource at their disposal- the pharmaceutical industry:

“Drug discovery and development is a long, costly, and high-risk process that takes over 10–15 years with an average cost of over $1–2 billion for each new drug to be approved for clinical use1. For any pharmaceutical company or academic institution, it is a big achievement to advance a drug candidate to phase I clinical trial after drug candidates are rigorously optimized at preclinical stage. However, nine out of ten drug candidates after they have entered clinical studies would fail during phase I, II, III clinical trials and drug approval2,3. It is also worth noting that the 90% failure rate is for the drug candidates that are already advanced to phase I clinical trial, which does not include the drug candidates in the preclinical stages. If drug candidates in the preclinical stage are also counted, the failure rate of drug discovery/development is even higher than 90%.”

Everybody knows how rigorous the drug development process is. Pharmaceutical companies expend colossal effort trying to optimize drug dose and trial inclusion criteria, in order to tease out the intrinsic efficacy of a new molecule, if it’s present. Since financial stakes are very high, every effort is made to minimize “noise” in trial results that could obscure an efficacy signal. And yet, even these maximally-financially-incentivized stakeholders have abysmal success rates for bringing new drugs to market. So viewing the situation in this light, maybe it’s not so surprising that researchers who are NOT affiliated with pharmaceutical companies (and therefore have fewer resources at their disposal) and who are often testing complex, nonspecific interventions (e.g., “sepsis bundles”, perioperative pulse oximetry) for heterogeneous/poorly-defined conditions, rather than intensively-targeted new molecules directed at highly-specific biologic pathways for homogeneous/well-defined conditions, rarely meet with success…

Discovering efficacious new treatments is very hard in medicine, across the board, even under “optimal” testing conditions. Since success is infrequent even in the noise-minimizing conditions created by pharmaceutical companies testing new molecules, should we really be surprised that success rates are near zero in fields where noise is rampant? “Insanity is doing the same thing over and over” and all that…

Apologies for resurrecting this old thread, but I came across some additional relevant publications and links (phrases that stood out to me are bolded).

Recently published:

This formulation has the following simple implication: when policy-makers optimize counterfactual utilities, then, in general, more people will die. Proponents of a counterfactual approach may argue: but all deaths are not considered equal; they may argue that the true utility, given by…, is one that uses the possibly asymmetric counterfactual utility function on the principal strata that is appropriately formulated to reflect notions of counterfactual harm. However, a serious problem is that we have no direct evidence that these principal strata exist . Even if one makes the metaphysical commitment to their existence, a patient will never know their true principal stratum, except under extreme circumstances; thus, no patient will ever know their post-hoc utility, and no policy may ever be evaluated, or compared to an alternative policy, using direct observations. In other words, the counterfactual approach requires a faithful belief in metaphysical objects whose existence can neither be confirmed nor denied. While patients may be free to hold such beliefs, policy-makers should know the implications: when a counterfactual framework is deployed to determine social policies and regulations, it coerces conformity to an unverifiable metaphysics and a corresponding logic that deals in those terms. In contrast, when an interventionist framework is deployed in such a setting, no such coercion is made, and patient and group outcomes are observable and thus transparent…

…We have defined and contrasted counterfactual and interventionist approaches to decision making. Contrary to claims of some authors [15, 7], a counterfactual approach should not necessarily portend a revolution in personalized medicine… In tension with proponents of a counterfactual approach, we have reviewed several practical and philosophical considerations that seem to problematize its use and challenge some of its core premises, e.g. that it somehow naturally corresponds with prevailing medical ethics and legal practice. Perhaps most problematic from a population policy-maker’s perspective: when the outcome is death and a counterfactual approach is used, in general, more people will die under the identified optimal policy compared to that identified by an interventionist approach. A strong critique of the counterfactual approach calls it “dangerously misguided”and warns that “real harm will ensue if it is applied in practice” [24]. We take the following stance: as causal inference become increasingly embedded in the development of personalized medicine, it is important that stakeholders clearly understand the different approaches to decision making and their practical and philosophical consequences.”

A rebuttal:

“…In a recent paper Mueller and Pearl, 2023, we illustrate an example of a treatment that diminishes the death rate by 30 percentage points, from 80% to 50%, equally in both men and women…

These conclusions are not metaphysical but logically derivable from the available data (assuming that the treatment and outcomes are binary and that the system is deterministic1 , hence every individual must fall within one of the four possible response types, or principle strata (S{1, 2, 3, 4}) as defined by SS)…

1 The deterministic assumption was contested by Dawid [Dawid, 2000] and defended in [Pearl, 2000]. Dawid’s contention emanates from the observation that the response of each individual may vary with unknown factors (e.g. time of day, previous history, patient’s mood, etc) and cannot, therefore, be a deterministic function of the treatment. However, if we include those factors in the definition of a unit, determinism regains its legitimacy (barring quantum uncertainties) .”

From post #213 above:

“This assumption of ‘consistency’ is therefore unverifiable and unrefutable by study and based on personal belief leading to a forceful assertion.

This quote seems to be supported by a statement at minute 6:43 in the video linked below: “I strongly believe that we are deterministic machines…”

I’m no philosopher, but belief in causal determinism seems pretty controversial (likely because it’s unprovable/“metaphysical”?). Those who don’t agree with a deterministic view of human biology/physiology/behaviour seem unlikely to adopt any proposed method of decision-making in their field that depends on deterministic assumptions…

In Dr.A.Gelman’s blog from July 26, 2021, there was an interesting discussion between statisticians about deterministic versus stochastic counterfactuals and the potential outcomes framework (entitled “A counterexample to the potential-outcomes model for causal inference”- unfortunately, the link below seems a bit wonky)

https://statmodeling.stat.columbia.edu/2021/07/26/causal-counterexample/

“As regards “I guess the right way to think about it would be to allow some of the variation to be due to real characteristics of the patients and for some of it to be random”, I guess I like to think in terms of mechanisms. In the case of adjuvant chemotherapy, or cardiovascular prevention, an event (cancer recurrence, a heart attack) occurs at the end of a long chain of random processes (blood pressure only damages a vessel in the heart because of there is a slight weakness in that vessel, a cancer cell not removed during surgery mutates). We can think of treatments as having a relatively constant risk reduction, so the absolute risk reduction observed in any study depends on the distribution of baseline risk in the study cohort. In other cases such as an antimicrobial or a targeted agent for cancer, you’ll have some patients that will respond (e.g. the microbe is sensitive to the particular drug, the patient’s cancer expresses the protein that is the target) and some that won’t. The absolute risk reduction depends on the distribution of the types of patient.”

In the comments section from this blog post:

Sander Greenland on July 26, 2021 2:41 PM at 2:41 pm said:

“Andrew: As an instructor like you are, I found deterministic models provide simple, intuitive results that often generalize straightforwardly to all models; but it seems Vickers was getting at how some mechanistic models are much better captured by stochastic potential outcomes. Thus early on I began using stochastic potential outcomes for general methodologic points…In light of my experiences (and the current episode you document) I have to conclude that adequate instruction in causal models must progress from the deterministic to the stochastic case. This is needed even when it is possible to construct the stochastic model from an underlying latent deterministic model. And (as in quantum mechanics) it is not always possible to get everything easily out of deterministic models or generalize all results from them; for example, it became clear early on that while some central results from the usual deterministic potential-outcomes model generalized to the stochastic case (e.g., results on noncollapsibility of effect measures), others did not (e.g., some effect bounds in the causal modeling literature don’t extend to stochastic outcomes). And when dealing with the issues of causal attribution and causation probabilities, Robins and I ended up having to present 2 separate papers for technical details, one for the deterministic and one for the stochastic case (Robins, Greenland 1989. “Estimability and estimation of excess and etiologic fractions”, Statistics in Medicine, 8, 845-859; and “The probability of causation under a stochastic model for individual risks”, Biometrics, 46, 1125-1138, erratum: 1991, 48, 824).”

I doubt that many physicians or social scientists (who contend with problems that reflect the interplay of biological, behavioural, and environmental complexity) would subscribe to a view of human beings as “deterministic machines” who will always react the same way when presented with a certain stimulus/input/treatment. It follows that few who work in these fields would entrust life or death decisions to a decision-making framework that is supported by such an assumption.

4 Likes

Once again, limited stats/epi training limits my understanding of the papers below, but it feels like they’re relevant to this thread. Is this an example of how statistical misunderstandings can become entrenched- perhaps, in this case, encouraging an an unrealistically enthusiastic view of the potential for “personalized” medicine (?):

“Simply put, the D-value is the proportion of patients who got worse after the treatment.”

Rebuttal:

“Personalized medicine asks if a new treatment will help a particular patient, rather than if it improves the average response in a population. Without a causal model to distinguish these questions, interpretational mistakes arise. These mistakes are seen in an article by Demidenko [2016] that recommends the “D-value,” which is the probability that a randomly chosen person from the new-treatment group has a higher value for the outcome than a randomly chosen person from the control-treatment group. The abstract states “The D-value has a clear interpretation as the proportion of patients who get worse after the treatment” with similar assertions appearing later. We show these statements are incorrect because they require assumptions about the potential outcomes which are neither testable in randomized experiments nor plausible in general.

“Over the ensuing two years it became one of the most downloaded articles in The American Statistician – which is alarming in light of the fact that all the causal claims in the article are incorrect…

…How were such profoundly erroneous claims justified? We will show that the claims can be derived by introducing a statistically nonidentified causal assumption, one which we regard as extremely implausible in every setting we can imagine. Because similar hidden assumptions appear to be behind other common misinterpretations of effect measures, and given the attention received by Demidenko [2016]**, we provide a detailed review of the core problem: failure to recognize when interpretations are based on strong and often implausible assumptions about the effect of treatment on outcome.”

3 Likes

Wow huge mistakes. Glad that someone is diligent in point out the errors.

I’ll just leave this here:

"In the causal modeling arena, the area that has seen the most transformative development in the past seven years has been counterfactual-based decision making, for example in personalized medicine. Many current health-care methods and procedures are guided by population data, obtained from controlled experiments or observational studies. However, the task of going from these data to the level of individual behavior requires counterfactual logic (Chapter 8), where significant new results have been obtained lately.

To exemplify these results, consider the problem of prioritizing patients who are in “greatest need” for treatment, or for testing, or for other scarce resources. “Need” is a counterfactual notion (i.e., patients who would have gotten worse had they not been treated) and cannot be captured by statistical studies alone, be they observational or experimental. A related notion is that of a “harmed” patient, namely, a patient who would die if treated and recover if not treated. Remarkably, despite the individualized character of these notions and the impossibility of observing the same patient both under treatment and under no-treatment, recent developments demonstrate that the “probability of harm” can be quantified by combining data from both experimental and observational studies. (See https://ucla.in/39Ey8sU.) The ramifications of these results are enormous, with applications in medicine, marketing and politics, since the essential criterion in every decision making context is always “situation specific,” be the situation a patient, a physician, an instrument, a location or a time."

3 Likes

Hi Erin

Your continued unease about this topic has prompted me to go back to my post no 244 https://discourse.datamethods.org/t/individual-response/5191/224?u=huwllewelyn and to provide a narrative explanation. I shall try to show that what Pearl and Muller did was re-discover how informative covariates can allow more detailed estimates to be made of the probabilities of outcomes on treatment and control, as opposed to new insights about counterfactual reasoning.

They claimed that during the observational study their volunteers chose treatment or no treatment in a revealing way. They said that the proportion that accepted treatment in the observational study was 30% and those who refused treatment was 70%. They said that if those in the RCT had been allowed to choose, the same percentage would have expressed positive and negative views about treatment. They also said that of the 70% of men who would have been positive about treatment (perhaps they had severe symptoms) 30% would have died. However, of the 30% of men who were negative about treatment (perhaps because their symptoms were mild <$> and they had more to fear about adverse effects), they all would have died in the RCT when allocated to treatment.

There is no reason why the RCT investigators could have asked the views of the volunteers before they were randomised to treatment or control (especially as they planned to encourage others to choose according to their inclinations in a subsequent follow up study). Their positive and negative attitude to treatment of the RCT volunteers could then have been used as covariates. If they had done so they would have detected this surprisingly tragic outcome in those men who were negative about treatment but allocated to treatment. They would then have been in a position to advise men in a subsequent observational study to refuse treatment if they felt negatively about it.

The important point about this is that the information Pearl and Muller claimed could only become available from an observational study, would be available already from the prior RCT by applying a covariate analysis. The matter of the validity of how this information can be applied to counterfactual reasoning is a separate issue as discussed in my post 226 (https://discourse.datamethods.org/t/individual-response/5191/226?u=huwllewelyn).

3 Likes

Thanks Huw! You’re right about my “continued unease” :slight_smile:

The reason why this bothers me is that 1) the feedback provided on the paper in question does not seem to have been taken seriously; and 2) there are a LOT of unscrupulous people in the world trying to profit from the idea of “personalized medicine.” The more pie-in-the-sky promises we see in the literature around personalized medicine, the more likely it becomes that governments will get caught up in the hype cycle. And if they do, there’s a real risk that they will end up diverting government funds from food, shelter, and basic medical care toward cynical profiteers.

Your narrative summary of the paper lays bare (for me at least), the extreme implausibility of its core premises.

6 Likes

Erin

I will try to show how government financial support could be used to make medicine more ‘personalised’ by improving the diagnostic process.

The medical profession has already been practicing ‘personalised medicine’ down the ages by arriving at diagnoses. I will show that Pearl and Muller have been trying to re-invent this wheel and basically getting a similar result using a very implausible example that made their reasoning difficult to follow. They claimed that they were addressing the outcome for a specific individual (hence @Stephen 's title of this topic being ‘Individual response’). However, what they were actually doing was dealing with a smaller subpopulation of the RCT subpopulation resembling that individual much more closely. We would describe these smaller subpopulations based on covariates as diagnostic categories.

The purpose of assessing individual response is to choose the plan of most benefit to the individual patient. Turning the clock back to create a ‘counterfactual’ situation and comparing the effect of treatment and placebo by doing this many times can be simulated with an N-of-1 RCT. The process of randomisation does the same job as a time machine. Comparing treatment counterfactually between groups of individuals using a time machine can be simulated with a standard RCT.

Choosing a test as a covariate based on a sound hypothesis before randomisation is basically to create a diagnostic criterion. We may thus create a positive test result that acts as a sufficient criterion that confirms a diagnosis If the latter is also a necessary criterion and therefore a definitive criterion, then a negative covariate test acts as a criterion that excludes a diagnosis. Those with a positive diagnosis are expected to respond better to treatment than those testing negative (the latter usually showing ‘insignificant’ or no improvement because they do not have the ‘disease’ usually). The degree of difference in the probability of the outcome with or without treatment reflects the validity of the test as a diagnostic criterion.

Pearl and Muller created an implausible diagnostic category not based on any reasonable medical hypothesis. Instead their positive diagnostic test was based on the patient ‘feeling positive’ about a treatment (and inclined to accept it if allowed to choose). A negative diagnostic test was those ‘feeling negative’ about a drug based on a negative feeling and inclined to reject it if allowed to choose). There was no plausible medical hypothesis for the potential utility of this ‘test’. However I shall refer to them as positive and negative test results to show how they behave as diagnostic criteria.

In their imaginary RCT on men, 49% of the total had a positive test and survived on treatment, whereas 0% of the total had a positive test and survived on placebo, showing a simulated counterfactual ‘Benefit’ of 49-0 = 49%. However, 0% had a negative test and survived on treatment and 21% had a negative test and survived on placebo showing a simulated counterfactual ‘Harm’ of 21-0 = 21%.

From this information we would advise a man with future positive result to accept treatment resulting in a survival of 49% but the negative testing man to refuse it resulting in a survival’ of 21% on placebo. By advising this for men in a subsequent observational study, 21% + 49% = 70% would survive.

In their imaginary RCT on females, 18.9% of the total had a positive test and survived on treatment, whereas 0% had a positive test and survived on placebo giving a ‘Benefit’ 18.9-0 = 18.9%. Moreover, 30% had a negative test result and survived on treatment and 21% had a negative test result and survived on placebo also giving ‘Benefit’ of 30-21 = 9%. Neither a positive nor negative result gave an outcome where placebo was better than treatment, hence ‘Harm’ in females was 0%. This gives a total ‘Benefit’ for females of 18.9+9 = 27.9%.

From this information we would advise a woman to accept a treatment whether she felt positive (giving 18.9% survival) or negative (giving 30% survival). By advising this for women in a subsequent observational study, 18.9 + 30 = 49.9% would survive. These figures are readily available from my Bayes P Map in post 224. (https://discourse.datamethods.org/t/individual-response/5191/224?u=huwllewelyn ) The P map representing the application of Bayes rule to the RCT data therefore confirms what Pearl and Muller found by using their mathematics.

I should add that many experienced physicians, especially endocrinologists, do not dichotomise test results into positive and negative, but interpret intuitively each numerical value on a continuous scale from very low to very high. This is done for the personal numerical result of each patient. There would be two such curves, one for patients on placebo and another for those on treatment (see Figure 1 in my post on ‘Risk based treatment and the validity of scales of effect’ https://discourse.datamethods.org/t/risk-based-treatment-and-the-validity-of-scales-of-effect/6649?u=huwllewelyn ).

At low / normal values the probability of an adverse outcome on placebo would be near zero with the treatment and placebo curves curve superimposed. This would provide near certainty in avoiding treatment as of no benefit for the purpose of personalised medicine. The remaining parts of the curve (see Figure 1 in my post on ‘Risk based treatment and the validity of scales of effect’ https://discourse.datamethods.org/t/risk-based-treatment-and-the-validity-of-scales-of-effect/6649?u=huwllewelyn ) would provide more personalised probabilities of outcome on treatment and control for each patient depending on the precise numerical result of a test than probabilities conditional on positive or negative results that lump together wide ranges of numerical results. If these curves were plotted for degrees of positivity / negativity from the example of Pearl and Muller, the curves would cross over implausibly at some the point.

Maybe the way forward for personalised medicine is for governments to support expert statistical input and other resources to help us clinicians to develop these curves and related concepts.

1 Like

Hi Huw

I appreciate your faith that the proposals in this paper will, at some point, permeate my thick skull if they are explained in different ways. Unfortunately, I think my brain resists attempts to make sense of arguments that hinge on wildly implausible medical/physiologic/behavioural premises and assumptions. It rebels against the futility of the whole exercise. Arguing that patient preference could serve as a reliable predictor of patient “responsiveness” to a treatment feels like a very deus ex machina way to overcome the barriers to personalized medicine. But it’s also very possible that I’m just not capable of thinking deeply enough about all this…

Since physicians have to make many treatment-related decisions every day, our decision-making framework needs to be compatible with rapid decision-making. I’m not saying that we won’t someday be able to fine-tune how we choose therapies for patients. Rather, I’m pointing out that if the process of “fine-tuning” takes an hour, it will never, realistically, be implemented. In that world, physicians would only be able to see two patients per day.

I see a lot more opportunity for “personalization” in fields where therapies have a combination of modest efficacy but guaranteed toxicity. Risk/benefit assessments can be challenging in these scenarios, so it’s important to elicit patients’ values and preferences (“utilities”) before recommending a particular course of action. But decisions of this complexity are not the norm in medicine. Instead, most physicians, without necessarily articulating exactly what they are doing, apply a more crude, yet efficient approach. They preferentially prescribe therapies with established intrinsic efficacy and then use informal versions of “N-of-1” trials to fine-tune their choices over time.

2 Likes

I agree that we make clinical decisions for the individual patients quickly and intuitively by taking the preferences or utilities of patients into account. The process is often iterative by trying different diagnoses or treatments / doses to optimise outcome, which is what I understand by ‘personalized’ medicine. This depends on reliable information that predicts outcomes with as much certainty as possible to minimise the number of iterations required.

This discussion was about a different matter. It was how to assess the ability of diagnostic information to allow diagnoses and decisions to be made as successfully as possible by leading to favourable outcomes as often as possible. If diagnostic tests and treatments are of little or no value, then we should not use them. The way diagnostic tests are assessed at present is very confused as illustrated by the bizarre example used by intelligent lay people and the discussion here that required a huge effort to disentangle. Poor assessment of diagnostic tests leads to a waste of money and clinicians’ time and needs sorting out. As you say, there is a danger that funding will be taken away from more rational approaches unless the latter can assert themselves.

3 Likes

I’m completely sympathetic with your feeling here, Erin. My impression from the outset had been that the P&M argument was simply bad-faith shitposting, and that everybody responding to it on this thread was being taken for a ride. Your recent post in which P&M apparently double down on all this suggests now a different phenomenon is at work.

1 Like

Perhaps P&M did not set out in bad faith. Perhaps they were carried away by thinking that they had discovered something new and exciting to improve medical diagnosis and decision making by adding observational studies. They made many mistakes by trying to prove this using an implausible example and tortuous and erroneous reasoning. For example the RCT control group was identical in females and males so that P(y) – P(y_c) for females and males should be the same. However, for females they came up with an answer of 0.09 but instead of being the same for males it was 0.49. This is how they arrived at the ‘correct’ values p(Benefit) and P(Harm), - by making a mistake when calculating one of them. We would only be able to know which was wrong if they had not failed to specify the value of P(y) (which could have been 0.3 or 0.7 when the value of p(y_c) was 0.21).

The point is that they created the RCT and observational study results and the values of P(Benefit) and P(Harm) by allocating values to them. I have shown already that these values could have been estimated directly from the RCT using covariates without having to resort to an observational study. I have already raised this with P&M on twitter in the past but they do not appear to have understood. Erin pointed out that Judea Pearl continues to publicise their discovery in the preface to a new book this year, claiming to be the way forward. On the contrary, I think it is by a better use of covariates in diagnosis and treatment selection, especially by analysing their numerical values carefully prior to any dichotomisation.

2 Likes

David

You might be right. If so, it’s important that criticism be properly contextualized. Questionable claims made in good faith by otherwise brilliant people, late in life, should not overshadow an impressive life’s work. I think that everyone in this forum would agree that compassion in such circumstances is essential. But it’s also important not to just sweep under the rug the harms that can flow when prominent figures dabble in fields they don’t understand. The COVID pandemic provided many disturbing examples of this phenomenon. People pay attention when prominent people speak, even if their ideas are (dangerously) off track.

I suspect that the author would have been more circumspect if he understood just how easily claims related to “personalized medicine” can be co-opted, in the modern world, by profit-seekers who lack any concern for patient welfare. These are not flames that most healthcare providers want to fan…

3 Likes

Quite possible the post-hoc power team from MGH didn’t set out in bad faith, either; but we know how that ended up. The crucial principle is to keep the faith.

Indeed, and this the only reason I even addressed P&M in this thread. Their argument effectively “floods the zone” Steve Bannon style, “stinking up the joint” for people who are keeping the faith with the concepts of individuality and person-centeredness in medicine.

2 Likes

You make another good point. There is more than one reason why people who actually work in the healthcare field might find claims like the ones in this paper so frustrating. Many physician scientists likely have a deep understanding of promising potential niches for more individualized approaches to therapy (e.g., dose selection to optimize risk/benefit; research to identify tumour markers that could result in development of new therapeutic targets…). But the expertise of these highly specialized applied scientists constrains them from making exuberant claims that might otherwise swallow huge sums of funding money and mislead upcoming young scientists.

2 Likes