Individual response

Blockquote
[information about why treatments are selected in the field] prior information is all important and seldom obtained.

It is true that in the total absence of information on a prognostic variable, randomization is the minimax optimal experimental design. My concern is the use of reported randomization as a post hoc criterion of study quality that automatically trumps observational research.

Statistician Paul Rosenbaum describes the considerations that go into the evaluation of observational studies:

Blockquote
Dismissive criticism rests on the authority of the critic and is so broad and vague that its claims cannot be studied empirically

The retrospective consideration of design factors as measures of quality (commonly seen in meta-analytic guidelines) strike me as meeting Rosenbaum’s definition of “dismissive criticism.” The relative credibility to place on a reported estimate is context sensitive. It isn’t the lack of randomization that needs to be noticed, but the assertion of a prognostic variate that should (and could) be controlled, but was not. Dismissal of data based on design features alone, strikes me as not scientifically constructive, if not completely irrational.

Your point about confounding by indication is important, and I concede there are likely many scenarios where I would want a randomized study.

My problem is that rehabilitation research is not going to get the funds to run RCTs that a drug or medical device would get, so the large sample sizes for credible RCTs are outside of the typical budget. The decision theoretic approach to experiments advises optimization over randomization when samples are small. Most interesting are mixed strategies, that inject a bit of randomness into algorithms that create a priori balanced groups. This would satisfy the frequentists who demand randomization, while Bayesian decision theorists are happy that expected information is being maximized by prior information.

Absent well done small experiments, I’d like to see well done observational studies, where there is a lot of room for improvement.

Nathan Kallus (Cornell) has a number of recent papers that goes into the technical details and decision theoretic analysis regarding the utility of the algorithmic creation of a priori balanced groups for controlled experiments and modifications of them for observational causal inference.

Some recent papers on the a priori use of nonrandom construction of balanced groups, as well as a recent paper with Art Owen (who developed empirical likelihood) about using observational studies to inform RCTs:

Blockquote
Relying on modern optimization methods, kernel allocation, which ensures nearly perfect covariate balance without biasing estimates under model mis-specification, offers sizable advantages in precision and power as demonstrated in a range of real and synthetic examples. We provide strong theoretical guarantees on variance, consistency and rates of convergence and develop special
algorithms for design and hypothesis testing

Blockquote
Motivated by an observational study in spine surgery, in which positivity is violated and the true treatment assignment model is unknown, we present the use of optimal balancing by kernel optimal matching (KOM) to estimate ATE. By uniformly controlling the conditional mean squared error of a weighted estimator over a class of models, KOM simultaneously mitigates issues of possible
mis-specification of the treatment assignment

Blockquote
We propose a novel framework for matching estimators for causal effect from observational data that is based on minimizing the dual norm of estimation error when expressed as an operator. We show that many popular matching estimators can be expressed as optimal in this framework, including nearest-neighbor matching, coarsened exact matching, and mean-matched sampling. This reveals their motivation and aptness as structural priors formulated by embedding the effect in a particular functional space. This also gives rise to a range of new, kernel-based matching estimators that arise when one embeds the effect in a reproducing kernel Hilbert space. Depending on the case, these estimators can be found using either quadratic optimization or integer optimization. We show that estimators based on universal kernels are universally consistent without model specification. In empirical results using both synthetic and real data, the new, kernel-based estimators outperform all standard causal estimators in estimation error.

Branson, Z. (2021). Randomization Tests to Assess Covariate Balance When Designing and Analyzing Matched Datasets. Observational Studies 7(2), 1-36. doi:10.1353/obs.2021.0031. (link)

Blockquote
In this work, we develop a randomization test for the hypothesis that a matched dataset approximates a particular experimental design, such as complete randomization, block randomization, or rerandomization. Our test can incorporate any experimental design, and it allows for a graphical display that puts several designs on the same univariate scale, thereby allowing researchers to pinpoint which design—if any—is most appropriate for a matched dataset. After researchers determine a plausible design, we recommend a randomization based approach for analyzing the matched data, which can incorporate any design and treatment effect estimator. Through simulation, we find that our test can frequently detect violations of randomized assignment that harm inferential results

Blockquote
The increasing availability of passively observed data has yielded a growing interest in “data fusion” methods, which involve merging data from observational and experimental sources to draw causal conclusions. Such methods often require a precarious tradeoff between the unknown bias in the observational dataset and the often-large variance in the experimental dataset. We propose an alternative approach, which avoids this tradeoff: rather than using observational data for inference, we use it to design a more efficient experiment.

There have been too many failures of observational treatment comparisons to be validated for me to resonate with what you’ve written. And to my eye rehabilitation research will not get a lot of respect until it routinely does RCTs with intention to treat as the basis for analysis. I have seen some doozy rehab efficacy claims in the face of dropouts that could explain all of the beneficial results away. But I do think that a middle ground is still useful: prospectively designed non-randomized rehab studies with prospective, expensive, data collection with minimal missing data, and analyze all data per the original treatment choice (like ITT).

2 Likes

Blog quote. From Mueller and Pearl

Reading that quote and the blog the first question which must be asked is what, mathematically is expressed by the imprecise word but pivotal word … "resembling" . The word is imprecise because there is no precise word or mathematical formula to replace the word yet it comprises a mathematical basis for a major distinction between the RCT and OBS trials. Further, the word defines the probability of the applicability of the RCT output to the instant patient under care…

The problem is that "resembling’ is multidimensional. By its nature the word is more precise when dealing with vegetables. A pea does not resemble, from a scientific perspective, anything other than a pea. Even when considering human diseases with relatively precise definitions the word “resembling” works. Resembling in duration of disease, demographics, complications of the disease, and comorbidities? Sure. However, those are the public facing portions, what about the target condition itself?

Lets consider two adverse Clinical Conditions:

  1. Pharyngitis caused by Group A Streptococcus (GAS) and
  2. Adult Respiratory Distress Syndrome (ARDS).

The first , a “strep sore threat” caused by GAS is a fairly straight forward and the term “resembling” works and can be defined by conventional means.

Now, how about ARDS, what does “resembling” mean in the adverse condition ARDS? We have to know, its a pivotal word. Here is the problem, no one knows. ARDS is a syndrome defined by a set of criteria which captures a range of diseases with similar appearing lung dysfunction. So what happens when we conflate a syndrome with a well defined disease? Here, as it relates to RCT the word “resembling” actually becomes word without definable mathematical meaning. The underlying pathophysiology may be completely different for many of the diseases within the syndrome of ARDS. Given that this is true (and it is) then each disease might respond differently to decision x .

To quote the blog "…population-based decision making optimizes the Conditional Average Causal Effect (CACE):

Great, but with ARDS, the CACE is not just the average effect for the individual disease but also the average effect of the percentage “MIX” of the individual diseases within the population who met criteria for ARDS.

The reason this matters is that the term “resembling” as it relates to an RCT now must also refer to the MIX of diseases in the RCT. In other words the stability of that MIX has to be known. If the Mix is unstable then the term “resemble” may be ephemeral, or even a mathematical illusion created by the construct of the syndrome itself.

So as long as the MIX is stable the RCT term resembling (now two averages away) still may have “some” value. However what if the MIX in the ICU using the output of the RCT is not stable? More importantly, from the perspective of the mathematical integrity of the RCT function, is it possible to measure the MIX and account mathematically for the MIX? Probably not.

The danger of the term “resembling” and of its lack of mathematical specificity was demonstrated to the worlds sorrow with the SARS-COV2 Pandemic. Here the MIX of diseases in ICUs across the world which comprised ARDS suddenly changed. The event of the changing of the MIX was the equivalent of the worldwide OBS trial examining the integrity of the RCT which defined the protocols for ARDS (treatment x). Now you night ask why would the RCT from the past apply to the pandemic. The answer is that the ARDS syndrome is defined by criteria NOT by any disease so it made complete sense to thought leaders to apply the results of the RCT to pulmonary dysfunction caused by SARS-COV2 but which met the past criteria for ARDS. It’s not the disease its the criteria. .

The result was predictable (at least to me, as I have been sounding the alarm for a decade and for quite a while in this forum). . ARDS protocols included early intubation. Death rates were high. The protocols were eventually abandoned for delayed intubation as a function of clinical observation by rebels on the front line. However this did not occur quickly because faith the the ARDS RCT was anchored deeply in the institutional sand.

Now it is not hard to see that OBS trials are required to track the performance of decision x in the real world. The question is, what is the role of RCT in complex conditions? What is the price paid for increasing N at the expense of adding unmeasurable heterogeneity? Its not worth the price.

Here is a review of the results of a worldwide OBS trial of RCT driven COVID treatment written for a wide audience.

In critical care the RCT function (The entire function) has to be considered mathematically. Recent expediting concepts such as RMAP-CAP do not address this problem. No sepsis RCT has been reproducible for 3 decades and no one has reproducibly proven that obstructive sleep apnea (OSA) is morbid in 4 decades. It should come as no surprise that both sepsis and OSA are syndromes defined by 20th century threshold criteria and studied by RCT.

However its not “thresholds” that is the problem. It is a deeper problem. It is a problem with an unwavering faith in randomization and adjustments to solve all hidden problems. The more dynamically complex the target, the more displaced that faith becomes.

But is it even deeper than that. Jureidni and McHenry argue that RCT are compromised by industry.

I think that’s a rare occurrence. Maybe the insidious change to leaders with different incentives (if true) has an effect but RCT generally seem to be well intentioned.

The real problem is likely institutionalized anchor bias. Its not a Popper problem its a Kuhn problem. Key opinion leaders are disinclined to doubt the opinion’s that they lead. In fact most scientists are somewhat anchored but now the institutions are. . We call the story “Thomas Kuhn Meets the NIH”.
In our paper we talk about “Protocol Failure Detection” and how those trusting RCT would not think such detection is indicated because the have so much faith in RCT as the final arbiter. That’s why they declined to build protocol detection into the system to sound the alarm early. Critical care protocol compliance is tracked but failure of the protocol itself, is not… .

However after COVID something major has to happen now to right the ship in critical care science. Most still do not know the ship sank. They first have to realize that.

The concept of Scott Mueller and Judea Pearl that a matching OBS trial subsequent to any RCT is necessary part of any RCT may be the answer. Certainly in critical care this is what has to happen. Both trials proposed and funded together. They are both an integral part of finding truth.

.

.

1 Like

I think we have different definitions and that is causing the disconnect. The probability of benefit is defined in our paper as “the probability that an individual would both recover if assigned to the RCT treatment arm and die if assigned to control.” The idea is that a person doesn’t benefit if they recover with treatment but would’ve recovered regardless without the treatment. This is not the same thing as Risk Difference (RD) or Average Treatment Effect (ATE). And without strong assumptions, we can’t determine P(\text{benefit}) from an RCT. We can only get bounds. Similarly, we can generally only bound P(\text{harm}).

In any set of patients, with binary treatment and outcome, there are four types of patients: never-recoverer, benefiter, harmed-by-treatment, and always-recoverer. P(\text{benefit}) is the proportion of people in that set who are benefiters and P(\text{harm}) is the proportion of people who are harmed by treatment. Generally, we can’t count those people because we can’t test, at the individual level, how a person responds to both treatment and no treatment. The best we can do is put bounds on these probabilities. In our paper, I carefully chose experimental and observational data such that the bounds are infinitely narrow. That’s why you don’t see bounds.

So, with a single treatment and within the same set of patients we can have P(\text{benefit}) > 0 and P(\text{harm}) > 0, because some are benefiters and some are harmed by treatment.

Patient self-selection in an observational study is precisely what we want. We don’t want the randomization of a clinical trial. We want the confounding that occurs when a patient selects treatment and the reason for that selection also affects the outcome. This is normally problematic and should be adjusted for. But when estimating P(\text{benefit}), it provides insight into the underlying mechanism and allows us to narrow the bounds. So different selection mechanism = good.

We assume the patient administers the treatment exactly as the treatment would be administered in a clinical setting. Maybe it is in a clinical setting, with the patient choosing treatment. Of course, in the real world, it’s very possible that the patient deviates from proper treatment administration and the effect of that isn’t negligible. We account for that then in the causal graph or through other means. But they’re essentially taking a different treatment at that point.

This is all consistent with consistency :slight_smile: . What consistency says is that a person choosing treatment will have the same outcome as a person forced into treatment (by being randomly placed into the treatment group of an RCT). However, consistency says nothing about that person, who normally would’ve chosen treatment, if they were forced to not have treatment. Similarly, consistency says that a person not choosing treatment will have the same outcome as a person forced to not have treatment. And, again, it says nothing about that person if they were forced to have treatment. This is why probabilities of recovery differ in RCT vs observational studies.

Consistency is a way of stating that treatment is the same in both the experimental and observational settings. We should expect that different percentages of people recover in the observational setting because the reasons they chose treatment often affect outcomes.

2 Likes

From the paper in question:

We conduct an RCT and find no difference between treatment (drug) and control (placebo), say 10% in both treatment and control groups die, while the rest (90%) survive. This makes us conclude that the drug is ineffective, but also leaves us uncertain between (at least) two competing models:

• Model-1 – The drug has no effect whatsoever on any individual and

• Model-2 – The drug saves 10% of the population and kills another 10%.

From a policy maker viewpoint the two models may be deemed equivalent, the drug has zero average effect on the target population. But from an individual viewpoint the two models differ substantially in the sets of risks and opportunities they offer. According to Model-1, the drug is useless but safe. According to Model-2, however, the drug may be deemed dangerous by some and a life-saver by others.

….Consider an extreme case where the observational study shows 100% survival in both drug-choosing and drug-avoiding patients, as if each patient knew in advance where danger lies and managed to avoid it. Such a finding, though extreme and unlikely, immediately rules out Model-1 which claims no treatment effect on any individual. This is because the mere fact that patients succeed 100% of the time to avoid harm where harm does exist (revealed through the 10% death in the randomized trial) means that choice makes a difference, contrary to Model-1’s claim that choice makes no difference.

The second and third bolded phrases above betray a deep (yet widespread) misunderstanding of the types of inferences we can make from RCTs. The phrase “where harm does exist (revealed through the 10% death in the randomized trial)” implies that you have inferred individual-level causality from the results of the RCT- however, this is something that we cannot do given the trial design as proposed. The most fundamental purpose of an RCT is to identify treatments with meaningful intrinsic efficacy, not individual patients with an inherent “predisposition to respond.

Only if we were to repeatedly cross individual patients over from one treatment to another could we make an inference about causality at the level of an individual trial subject. And, as an aside, clearly, if we were interested in death as the outcome of interest, a cross-over design would not be possible.

In the trial as you have proposed it, it is entirely possible that none of the deaths in either arm had anything to do with the treatment in question- we can not infer anything about the cause of death of an individual patient in either arm of the RCT.

This is not a correct statement. There is potentially another very large group of patients- those who will respond on some occasions but not others. This fact is perhaps not obvious to those without a clinical background. Just because a patient “responds” on one occasion to a treatment in no way guarantees that he will respond the next time. The reasons why this can occur are numerous e.g., the underlying condition being treated may manifest with differing severity from one episode to the next (making it difficult for a patient to perceive treatment benefit); or the definition of “responder” might be arbitrarily dichotomous and insufficiently granular to acknowledge lesser or greater degrees of response from one episode to the next.

Again, this statement betrays a belief that we can make inferences regarding treatment efficacy at the level of individual patients enrolled in an RCT- and this is not possible except under very specific circumstances (as noted above).

The article and video below articulate these idea much better:

5 Likes

Thank you @scott for responding to my comments. You are right about our different definitions leading to a ‘disconnect’ between us. I would very much like to identify and understand these differences, which still seem to exist.

Firstly, you use the terms death, survival and recovery. Everyone in your RCT and observational study will die eventually of course. I am assuming that we mean death up to a time T and survival up to a time T. Would you please explain in this context what you mean by ‘recovery’. I don’t expect that you mean recovery from death.

You state that the probability of benefit is “the probability that an individual would both recover if assigned to the RCT treatment arm and die if assigned to control”. Are you suggesting that this is a single probability or two different probabilities that happen to be identical? Being assigned to treatment and control, and death and survival are counterfactuals, so to my mind the probability of both occurring simultaneously (also both survive up to time T and die up to time T) has to be zero. This is why I can’t understand your concept.

My understanding of measured benefit in this context is that it is the difference between the frequency (or estimated probability) of survival on treatment minus the frequency (or estimated probability) of survival on control. Benefit is therefore not an event that can be observed for an individual person and assigned a probability; it is merely an index equal to the difference between two proportions or two estimated probabilities.

The probability of ‘true’ benefit could be regarded as the probability of discovering that the frequency of survival up to T years on treatment is at least higher than the frequency of survival on control up to a time T if a RCT were continued until there were an infinite number of subjects in each limb, for example. It is therefore a statement about a study on groups of people, not an individual. Because of the large numbers in your RCT example, this probability of true benefit should be very high (subject to the assumptions used to estimate this probability).

I would be grateful if you could clarify what you mean by benefit, how you estimate its probability and how it differs from my understanding. Without understanding this difference between us, I won’t be able to understand the rest of your reasoning.

3 Likes

This is correct. Recovery and death were chosen as clear outcomes. We can imagine a very deadly disease that you either die from or recover from with no in-between. You can substitute anything else you’d like for recovery and death, any binary outcome.

P(\text{benefit}), the probability of recovery if treated and death if untreated, is a single probability. Imagine, out of 100 people, 30 of them are the type of people that will die unless they get treatment. And the treatment will make them recovered. Then P(\text{benefit}) = 0.3. Those are the only people you want to treat. Unfortunately, we usually don’t know which of the 100 people they are. In fact, we usually don’t even know how many there are. An RCT usually can’t tell us this. Often the best we can do is give bounds, something like, “there are between 20 and 40 people who would benefit.” The reason we can’t estimate this is because we can’t go back in time after giving a person treatment and then not give them treatment (or vice-versa).

No, you are describing the results of an RCT. The Average Treatment Effect (ATE) or causal Risk Difference (RD). That offers a lower bound on P(\text{benefit}), but it’s possible the two quantities are quite different.

In section 2 of our paper, we present RCT results where 90% of patients survive in both treatment and control arms. So the difference in survival is 0. It would seem the treatment is useless. That’s possible. Another possibility is that 10% of people benefit from treatment and 10% of people are harmed by treatment (will die if you give them treatment and will survive if you don’t give them treatment). This situation is consistent with the RCT results. We see 90% of patients treated as surviving because 10% of them benefitted from the treatment, 80% of them always recover regardless of treatment, and 10% died because of the treatment. And we see 90% of patients treated with a placebo as surviving because 10% of them died since they needed treatment, 80% of them always recover regardless of treatment, and 10% of them recovered because they weren’t harmed by the treatment.

Hopefully this clears things up?

3 Likes

Thank you again @scott. I think I understand what you are doing although I still have difficulty with some of your terms (e.g. harm) as from my understanding those harmed and benefiting have to belong to different sets (e.g. those in the set of those benefiting by treatment of the disease and those in the set of those harmed from unwanted adverse effects of the treatment). Based on your RCTs result alone I calculate the probability of ‘benefit’ as 0.279 and the probability of harm is zero (both the same as you). The probability of ‘never survive / never recover’ is 0.511 and the probability of ‘always survive’ is 0.210 for males and females.

However by taking the RCT and female observational data into account the probability of never survive / never recover is 0.0175 or 0.0177, the probability of benefit is 0.113 or 0.114, the probability of harm is 0.623 or 0.630 and the probability of always surviving is 0.239 or 0.246 (the slightly different results depending on slightly different denominators).
PS on 8 April 2022: By using the percentages from Tables 1 to 3 and some rounding up, I get a different result: the probability of ‘never survive’ is now 0.02 (2%), the probability of ‘benefit’ is 0.28 (28%), the probability of ‘harm’ is 70% and the probability of ‘always survive’ is 0.

I should add that I do not regard the members of these 4 sets to be fixed. For example if the study were repeated with the same subjects,the overall proportions would be similar next time but some subjects in the ‘benefit group’ for example might have moved to the ‘never survive group’ next time and so on because of stochastic issues. Also I would normally base a decision to treat on the proportions dying from disease and dying from adverse treatment effects in the treatment and control groups. How might the proportions / probabilities of the above 4 groups (‘never survive’, ‘benefit’, etc) result in a different decision?

In order to fully understand on my terms also based on medical / mechanistic processes, how you arrived at your results that are different to mine and why I may be wrong or have misunderstood, I will have to ‘translate’ the mathematical expressions in your paper into the verbal reasoning and a familiar medical example of how you arrived at your upper and lower bounds. Do you have such a ‘translation’ that has already been done?

1 Like

There is a parallel discussion on twitter. This relates to our need for math help relating to RCT applied to “syndromes” comprised of many diseases. So considering the math its not hyperbole to say that the world needs the help of this group with this problem.

ARDS (Acute Respiratory Distress Syndrome) is set of diseases x1, x2, x3,
Each disease may have a different response to treatment. An outlier disease may have a markedly different response, (perhaps harm). The mix of diseases may change from RCT to RCT. How do we present this math showing the potential differences in outcome responsive to a treatment across multiple RCTs (as a function of variations in the mix of diseases under test and/or variations in the differences in response of each disease).

Here the issue is whether or not “measurements” comprising a threshold set of lab and/or vitals (plus clinical suspicion of the syndrome) will render reproducible RCT. IN 30 years RCT’s applied to the syndrome sepsis have never been reproducible but the trialists keep trying with new threshold sets. So far since 1992, Four different sets have been tried (Three were SIRS or derivations of SIRS and the forth is delta 2 SOFA ). The problem is that these sets capture many diseases which look similar but each disease included brings a set of its own heterogeneity.

So to explain further with Sepsis or ARDS we have:

  1. a clinical context such as “suspicion of infection” or a CXR or CT pattern
  2. a set scores from lab or vitals threshold breaches (to which a function may be applied) rendering a diagnosis

For sepsis the problem here is that there are perhaps a hundred different disease captured by this method but all are encapsulated for the RCT under the syndrome “sepsis”. For ARDS, less diseases but still probably more than 50.

With the Pandemic the issue of measurement defining ARDS came under question. COVID pneumonia met ARDS criteria so it was automatically included in the diagnosis of ARDS. However COVID pneumonia was pathologically distinct exposing the problem with using threshold criteria sets to capture many diseases under one “syndrome”.

Unlike sepsis where RCTs failed to show reproducible benefit, The thresholds seemed satisfactory in the ARDS trails. There were many diseases included but evidence of benefit to a treatment was identified. This was perceived to indicate a fundamental effect which transcended the disease type so protocols were developed. Then in 2020 COVID pneumonia met the ARDS threshold criteria and was added into the mix of diseases under the ARDS definition. However this disease had outlier pathophysiology. At least a portion of the protocols (eg early intubation and PEEP titration by tables) seemed to fail. This exposed the ARDS criteria (also called “Berlin Criteria” ) as potentially capturing too many different diseases (just like the sepsis criteria did).

So the critical care trialists do not know what to do. You might say “study one treatment with one disease with one RCT”. This is an option but N would be low for some of the diseases unless multicenter studies were used. The reason they were combined in the past as sepsis and ARDS in the 1970s was they looked similar clinically. Threshold decision making emerged with Palker and others in the early 1980s. Theories of common pathophysiology underlying the clinical similarity were advanced. (These have not been proven). But it seemed logical at the time to define these conditions and study them by RCT together using threshold sets to define them and randomization to solve for any heterogeneity.

Now again to the math. How can we present to trialists by mathematical model, the effect of variation of the mix of diseases (in relation to the potential difference in treatment response of each disease) to the outcome of each RCT. For example, how can we model the effect of the introducing many cases of one severe outlier disease (eg COVID pneumonia) to the mix of the diseases under the ARDS definition.

Also it would would be great if statisticians would contribute to a debate thread with a critical care trialist on on twitter which parallels this discussion. There is acute need. New variants may be coming. We need the math help.

Forgive me for moving the focus to the instant crisis of critical care RCT but reading the cited blog of Judea Pearl & Scott Mueller through the eyes of COVID ARDS research it is right on target. The deeply rooted and present dogma of critical care trialist (which emerged in the early 1990s) is that a large gouping of different diseases with similar clinical characteristics can be combined into a massive syndrome by the use of a set of threshold criteria to define the syndrome. The syndrome can then be studied by RCT and treated with one size fits all protocol as if it is one disease.

This produces hidden responder and harm subgroups the blog discusses. in fact the response may comprise a spectrum from marked benefit to severe harm.

So the blog is particularly timely and relevant to the instant needs of critical care research theory.

I agree with what you say @llynn. Medical progress has been characterised by subdividing or ‘stratifying’ disease. The historic diagnosis of ‘dropsy’ was shown subsequently to include congestive cardiac failure, nephrotic syndrome, etc each with different treatments that could cause harm if given inappropriately to those with the ‘wrong’ diagnosis. However, it is important that those from different disciplines understand each other and their potential roles in order that they can tackle these issues together. I hope that the discussions here are helping to do that.

4 Likes

Hi again @scott. I have read your paper yet again and have a number of questions. You state that:

“In terms of our notation, consistency implies:

p(yt|t) = P(y|t), P(yc|c) = P(y|c). (4)

In words, the probability that a drug-chooser would recover in the treatment arm of the RCT, P(yt|t), is the same as the probability of recovery in the observational study, P(y|t).”

For females, my understanding is that: p(yt|t) = 0.489 and p(y|t) = 0.27 so that p(yt|t) ≠ p(y|t). Also p(yc|c) = 0.21 and p(y|c)= 0.7 so that p(yc|c) ≠ p(y|c).

For males, my understanding is that: p(yt|t) = 0.49 and p(y|t) = 0.7 so that p(yt|t) ≠ p(y|t). Also p(yc|c) = 0.21 and p(y|c)= 0.7 so that p(yc|c) ≠ p(y|c).

I would be grateful if you could explain this; perhaps the above probabilities that I have inserted are incorrect. In my calculations, I assumed that consistency implied that the ATE represented by risk differences was the same in the RCT and observational group so that the above probabilities were not surprising. Do you think that this explains the difference between the results of our calculations?

You state that: “leveraging both experimental and observational data, Tian and Pearl (Tian and Pearl, 2000) derived the following tight bounds":

However, I have been unable to translate this Equation (5) into a medical rationale using the example. Instead, based on my reasoning in previous posts I come to the conclusion from the data in your tables that:

For FEMALES based on the RCT alone:

p(Never survive) = p(y’t, y′c) = 0.511. p(Benefit) = p(yt, y′c) = 0.279, p(Harm) = p(y’t, yc) = 0 and p(Always survive) = p(yt, yc) = 0.21.

For FEMALES based on the RCT and observational study:

p(Never survive) = p(y’t, y′c) = 0.02, p(Benefit) = p(yt, y′c) = 0.27, p(Harm) = p(y’t, yc) = 0.71 and p(Always survive) = p(yt, yc) = 0.

In order for the observational data to provide perfect consistency with the RCT for females in terms of ATE / risk difference, your example data should have been slightly different so that p(Benefit) = p(yt, y′c) = 0.279, p(Harm) = p(y’t, yc) = 0.701.

For MALES based on the RCT alone:

p(Never survive) = p(y’t, y′c) = 0.51. p(Benefit) = p(yt, y′c) = 0.28, p(Harm) = p(y’t, yc) = 0 and p(Always survive) = p(yt, yc) = 0.21.

For MALES based on the RCT and observational study:

p(Never survive) = p(y’t, y′c) = 0.02. p(Benefit) = p(yt, y′c) = 0.28, p(Harm) = p(y’t, yc) = 0.28 and p(Always survive) = p(yt, yc) = 0.42.

I should emphasise again that I do not regard the members of these 4 sets to be fixed. For example if the study were repeated with the same subjects, the overall proportions would be similar next time but some subjects in the ‘benefit group’ for example might have moved to the ‘never survive group’ next time and so on because of stochastic issues.

These are the results of my calculations based on your data. I would be grateful if you could explain the difference between these and the results of your calculations based on the rationale of Tian and Pearl (Tian and Pearl, 2000).

2 Likes

P(y_t|t, \text{female}) \ne 0.489
Conditioning on t means “among those choosing treatment”, but in the RCT we don’t know what they choose. The RCT tells us:
P(y_t|\text{female}) = 0.489

Similarly, P(y_t|\text{male}) = 0.49 \ne P(y_t|t, \text{male}) = 0.7

I think this may explain the differences in our calculations.

While this is correct, you wouldn’t be able to determine it from just the RCT alone. Males have almost RCT results yet these probabilities are very different for males.

Why would adding observational data change these probabilities? RCT would only give you bounds. Adding observational data will narrow the bounds. In this case, they would narrow the bounds so much that the bounds would become point estimates.

1 Like

Thank you again @scott. You said that:

My understanding from this is that p(yt|t,male) [or alternatively p(yt|t,female)] is based on a 3rd scenario, where the male subject chooses treatment or no treatment and is then supervised with care as in a RCT. The 1st scenario is the RCT (with the notation p(yt|male) where the subject does not choose but is randomised to treatment or placebo (note that placebo may not be the same as ‘no treatment’). The 2nd scenario is the observational study where the subject chooses treatment or no treatment [with the notation p(y|t,male)] and is not supervised carefully and may be subject to errors of dosage etc.). When M = male, F = female, y’ = death, y = survival, t = choosing treatment and c = choosing control, then for the 3rd scenario:

For females p(y’t|t,F) = 0.511 and p(y’c|c,F) = 0.79 so that p(Benefit|F) = 0.790 - 0.511 = 0.279. Similarly, p(yt|t,F) = 0.489 and also p(yc|c,F) = 0.21 so that p(Benefit|F) = 0.489-0.21 = 0.279. P(Never survive|F) = p(y’t|t,F)= 0.511, P(Always benefit|F) = p(yc|c,F) = 0.21. Therefore p(Harm|F) = 1- 0.511-0.21-0.279) = 0. This means that your Scenario 3 for females is identical to Scenario 1, the RCT on females.

For males p(y’t|t,M) = 0.3 and p(y’c|c,M) = 0.79 so that p(Benefit|M) = 0.79-0.3 = 0.49. Similarly, p(yt|t,M) = 0.7 and p(yc|c,M) = 0.21 so that also p(Benefit|M) = 0.7 - 0.21 = 0.49. P(Never survive|M) = p(y’t|t,M)= 0.3, p(Always benefit|M) = p(yc|c,M) = 0.21. you also calculated that p(Harm|M) = 0.21. This means that the probabilities add to 0.3.+ 0.49 + 0.21 + 0.21 = 1.21 in your Scenario 3 for males, which is of course different to Scenario 1, for the RCT on males as you emphasised above. Have I misunderstood something here? (Note that I amended these two paragraphs about males on 27 April 2022).

I still can’t identify the assumptions that you have made and therefore don’t understand why p(Harm) = p(Benefit) – ATE. I also cannot identify the assumptions behind your reasoning when calculating p(Benefit) from the maximum and minimum values obtained from the RCT and observational study. I would be grateful for clarification. It may be that your definitions of benefit and harm are different to mine and statisticians such as @f2harrell and @stephen.

A RCT is designed to show a difference between treatment and control and can only show harm or benefit for a single outcome. If the frequency (and therefore estimated probability) of a happy outcome (e.g. survival for 2 years) on treatment is HIGHER on treatment than on control, then this represents benefit. If the frequency (and therefore estimated probability) of the same happy outcome on treatment is LOWER on treatment than on control, then this represents harm. Both represent an ATE but in opposite directions. According to this definition, harm and benefit cannot happen at the same time in a RCT result that focuses on one outcome. Therefore when you talk of both benefit and harm happening at the same time in Scenario 3, you may mean something different to my understanding of benefit and harm. I would be grateful if you could spell out your definition of benefit and harm for comparison.

I agree that both harm and benefit can arise during an observational study or RCT (but there was no harm in your example RCT). However, in practice the nature of the harm and benefit must be different and independent (e.g. benefit by reducing the risk of death due to the disease and harm by increasing the risk of death due to an adverse treatment effect). I explain this in detail in a previous post (Post No 63). In this post I assume that the ATE and degree of benefit (represented by a risk difference of 0.279 for females and 0.28 for males) is the same for ‘death due to disease’ in the RCT population and in the observational study population (see Tables 2 and 5 in Post 63). However, this does not account for all the deaths in the observational studies, so I assume that these excess deaths are due to a different outcome such the adverse effect of treatment (see Tables 3 and 6 in Post 63). Note that in my example I only deal with populations, not individuals as an ATE based on a risk difference cannot be applied to individuals unless they have identical disease severity and basal risk outcomes equal to the averages of the latter…

Personalised evidence and decisions

A physician like me interpreting a RCT result designed by someone like @f2harrell or @stephen for an individual patient will assume that the expected ATE / benefit arising from the RCT will be similar in other settings. However in medicine the ATE is represented by a ratio (e.g. a risk ratio or odds ratio) rather than a risk difference. This means that if the basal untreated risk of an unhappy outcome for an individual is lower than the average basal untreated risk of the outcome (because the disease is less severe than average) then the individual’s personal absolute risk reduction will be lower too and the degree of benefit will be lower.

This approach to assessing personal risk and benefit based on risk ratios models a physician’s intuition that an individual with a mild disease has a lower absolute risk reduction and a lower degree of personal benefit than someone with moderate disease. The risk ratio or odds ratio model may not fit the data precisely, a problem that can be reduced by calibration (see section 5.6 in https://arxiv.org/ftp/arxiv/papers/1808/1808.09169.pdf ). Furthermore someone with no disease will have a basal untreated risk of an unhappy outcome of zero so that the absolute risk reduction is also zero. In other words, treating someone for a disease when a disease is not present does not confer benefit. Another problem of basing ATE on a risk difference (instead of a ratio) is that when the basal untreated risk is low, the treated risk may have a negative value.

The ATE based on risk difference has a role in economics and medical epidemiology. It can be applied to decisions by epidemiologists about what to do for a population by assuming an average disease severity for that population. This is the approach of EBM currently, which has been set up by epidemiologists. However, when it comes to risk assessment for individuals, the average will not do and a personal degree of disease severity and a corresponding personal basal untreated risk of an outcome are of crucial importance as explained above.

I hope that his may explain the reason for our different interpretations of the data. It is because our assumptions and models seem to be different. Would it be possible for you to apply your approach to risk ratios instead of risk differences in order that a better comparison can be made with a physician’s approach to personalised risk assessment?

3 Likes

I hope it’s OK I “resurrected” this thread (very good one!), but here is the paper Dr. Senn mentioned above in the initial post:

@scott, this paper is now out and the pdf can be found here. It focused on integrating selection diagrams with traditional biostatistics and the potential outcomes notation to inform patient-specific decisions. It ended up being quite long despite there being many topics that could have been expanded. Take a look if you get a chance as I am looking forward to seeing whether the way we approach these patient scenarios can be integrated with the concepts you are developing. @HuwLlewelyn, this may interest you as well.

4 Likes

Thank you @Pavlos_Msaouel. You explain the rationale and the use of DAGs very well. My difficulty might be that I am not an oncologist with detailed knowledge of the immune mechanisms connected with renal cell carcinoma and its treatment that would be necessary to understand such reasoning fully. Therefore my immediate thought is how you would set out to validate such a model in a general way. This also applies of course to models based traditionally on knowledge of physiology and biochemistry alone (e.g. as in my speciality of endocrinology). Do you think that calibrating the outcome probabilities generated by your model somehow would be a way forward?

4 Likes

Exactly. Creating robust well-calibrated models informed by mechanistic knowledge and flexible statistical methods is the goal. This may also help address the big data paradox discussed here.

3 Likes

The problem is how best to calibrate outcome probabilities. I will suggest some ‘wild thoughts’!

There is a recent article suggesting that this is not done or considered properly by advocates of AI, machine learning or perhaps those applying causal inference (Sloppy Use of Machine Learning Is Causing a ‘Reproducibility Crisis’ in Science | WIRED).

The above article suggests that some may only calibrate on the ‘training’ data and if this is promising, prompt them to make ‘hyped up’ claims. I suppose that if the above ‘circular’ practice shows poor calibration then it says that the model is very poor. If the above calibration is promising then it is of course essential to calibrate on a separate test data set.

If a model is fitted separately to treatment and control sets that were created by randomisation, the the data from the treatment set might be used to calibrate the control model and vice versa. It would also be important to calibrate on data from another centre etc to assess transportability of the model.

If there was no randomisation then calibration might be performed on data from those in one range that were treated and on data from individuals in another range that were not treated. If both were well calibrated then this does not provide evidence for treatment efficacy of course. Perhaps fitting logistic regression functions or other models to the data in both ranges might allow extrapolation into other ranges to suggest efficacy.

I would be grateful to @f2harrell for advice about all this and the currently recognised principles of calibration. An account about such principles that are easily understood by doctors, medical scientists and advocates of AI and causal inference would be invaluable.

5 Likes

Good points Huw. My stuff on calibration is in Biostatistics for Biomedical Research - 10  Simple and Multiple Regression Models and Overview of Model Validation and Regression Modeling Strategies - 4  Multivariable Modeling Strategies

There are many situations where “test data set” is ill-defined and strong internal validation using resampling is more informative. Done correctly resampling does not reward any overfitting.

Developing a model on one of two treatment arms, if the same size is not huge, can result in fitting idiosyncrasies in one treatment arm in a way that exaggerates treatment differences. I prefer to not do separate modeling by treatment group.

4 Likes