Individual response

Exactly! I wrote something very similar on the aforementioned draft regarding approaching such problems as an IV analysis using RCT data. Will post a link to the article once it is out.

Both points I made above are two sides of the same coin. It gets even better: the advantage of Bayes is that it gives us the flexibility to intuitively model these challenges as hybrid problems of both ignorance and randomness (or other physical design processes such as blocking etc). However, we have to be constantly mindful of connecting our models with putative causal mechanisms when we do that. This was insisted upon primarily by traditional frequentists. It is good advice.

1 Like

I’ve read the post which is really fantastic. So I conclude that what I was advocating applies to disease-free survival time but not to overall survival. For the latter, the ITT treatment effect estimates a policy estimand, e.g., compares those randomized to treatment B with all the subsequent treatment modifications that happened to them with those randomized to treatment A with all the subsequent developments happening to them. I think this is still a causal estimand, it’s just a policy estimand rather than a “we control what happens” estimand.

A side question for you: If we do a state transition model and use it to estimate state occupancy probabilites such as P(pt has disease returned by 6m and is alive) will that provide anything useful to the discussion? This is a simple unconditional probability (except for conditioning on treatment and baseline covariates). One can also get overall survival probabilities from this model, but they may have the same problem you wrote about (except that the model will fit better because you can allow different covariates for death vs. for the disease recurrence state).

2 Likes

Exactly! And this key point becomes clear when we draw the causal graphs. Otherwise it is very hard to see what the modeling challenges are here.

When thinking of overall survival, we are looking for decision rules that will lead to optimal long-term benefits. Such decision rules need more work to be estimated than standard RCT models will provide for chronic diseases, such as many cancers nowadays. Good problem to have, it’s a consequence of our patients living longer and better. But a challenge nonetheless for health systems, regulators, patients, clinicians, and methodologists.

I would expect yes. It is a more complete model. But having gone through the laborious exercise of estimating decision rules from RCTs that randomly allocated interventions sequentially (ideal scenario that almost never happens in oncology) what I learned is the importance of quality data. No amount of elaborate modeling can salvage an RCT that did not collect the right data. And the best way to see in the design phase ahead of time what we need is to draw the causal diagrams. Intuitively, people can sense that we’ll need information on subsequent therapies. But what the graphs reveal is that we’ll also need information on covariates at transition times. This way we can make sure to collect them.

One of my favorite models for such purposes is described in this JASA paper that generated a lot of discussion at the time within the statistical methodology community. Figure 1 that sets the challenge is a causal diagram. It is essential for the model. Causality lies at the foundations of all statistical modeling. In fact, some of the current ongoing twitter discussions that prompted this thread repeat heated arguments between Fisher and Neyman, just using different terminology. They both undoubtedly thought causally despite lacking today’s more rich notation and approaches.

2 Likes

Hi Pavlos

As noted previously, I think that the type of work you’re doing is really very unique. I hope that this uniqueness is being fully appreciated/recognized by your colleagues (I strongly suspect that it is).

You seem to belong to a very small/rarefied group of researchers in the world. Your specialization in oncology presents a multitude of important longstanding/unresolved challenges [e.g., how to identify therapies that, when used serially with other therapies, can be expected to improve a patient’s overall (not just progression-free) survival]. In turn, these challenges have, effectively, forced “outside the box” thinking, ultimately causing you to ask if there might be a role for causal modelling in optimizing RCT design. Then you went a huge step further, learning the language of causal inference epidemiology to see how this might work.

Now, after learning the language of causal modelling, you have identified an important potential niche for it in optimizing the design of oncology clinical trials. To me, this work feels like the perfect clinical application of DAGs, and the one with the most potential to impact patient care.

I suspect that a key problem, to date, has been that researchers trained in more modern causal inference methods have perhaps lacked the clinical background and/or clinical incentive to search for alternate applications of these methods that would be accepted by the clinical community.

Historically, clinicians/statisticians have balked at the seemingly never-ending promotion of DAGs as a way to derive causal inferences from observational data alone. Clinicians have pushed back, asking: 1) given that these methods have been around for years, why are we only seeing them used in a very small fraction of published observational research?; and 2) why would we ever believe that these methods can generate results that are reliable enough to influence patient care decisions (except, perhaps in the case of strong, consistent safety signals derived from well-conducted studies)? At the end of the day, it’s unlikely that we will ever consider non-randomized evidence to be on par with observational evidence for the purpose of assessing treatment efficacy.

What you have done, effectively, is to carve out a niche for these methods as a way to optimize RCT design (the study design that clinicians consider to be optimal, in most cases, for guiding clinical decision-making), de-emphasizing the historical push for their application as “stand-alone” methods to make causal inferences from observational data.

This is all very exciting…

2 Likes

You are very kind, thank you for summarizing these efforts better than I ever could!

A lot of this is just us standing on the shoulders of giants, many of whom are regulars on this forum. For example, @Stephen taught us that if something is not helpful in RCTs then it is even less likely to be useful in analogous observational studies, whereas the converse is not necessarily true. And it is also his insistence on debating the Lord’s paradox (latest post here with discussion and links to previous entries) that is nicely highlighting limitations / challenges for graph-based causal inference schools.

It is indeed true than when a framework is shown to work empirically, it gains acceptance. We are catching flaws in our studies earlier and design them more efficiently to learn from mistakes and gradually improve patient care. And because this makes more people get involved, they then bring their own unique perspective into the mix creating a nicely dynamic ecosystem.

2 Likes

Philosophers have written enough papers to fill a library on why this position is logically untenable, and no amount of appeals by physicians to the unique aspects of patient care will overcome this.

My major gripe is 1. the EBM position is logically false, and the vast majority of uncontrovertial treatments would be ruled out by EBM. 2. Other fields uncritically adopt EBM rhetoric and inference rules, leading to nonsense in the peer reviewed literature.

One of the real world cases that drew attention of philosophers of science and ethics involved study of ECMO for newborns in respiratory distress. Richard Royall, a Johns Hopkins professor of biostatistics had this to say on this dogma of EBM from both an ethical and statistical POV. A link to the paper is in the thread.

Blockquote
We urge that the view that randomized clinical trials are the only scientifically valid means of resolving controversies about therapies is mistaken, and we suggest that a faulty statistical principle is partly to blame for this misconception.

To be blunt about this demand for randomization in all treatment contexts, how many infants would need to be randomized to ineffective interventions (and in essence, condemned to die with high probability) in the ECMO case?

Paul Rosenbaum, another statistician, has written a number of papers showing how a carefully done observational study can, by demonstrating insensitivity to unknown confounders, approximates a randomized experiment.

Blockquote
Randomized experiments and observational studies both attempt to estimate the effects produced by a treatment, but in observational studies, subjects are not randomly assigned to treatment or control. A theory of observational studies would closely resemble the theory for randomized experiments in all but one critical respect: In observational studies, the distribution of treatment assignments is not known…Using these tools, it is shown that certain permutation tests are unbiased as tests of the null hypothesis that the distribution of treatment assignments resembles a randomization distribution against the alternative hypothesis that subjects with higher responses are more likely to receive the treatment. In particular, these tests are unbiased against alternatives formulated in terms of a model previously used in connection with sensitivity analyses.

Rosenbaum elaborates on the randomization fallacy in this 2015 paper:

Rosenbaum, P. R. (2015). How to see more in observational studies: Some new quasi-experimental devices. Annual Review of Statistics and Its Application, 2, 21-48. (PDF)

Blockquote
The statistical literature may be misread to say that only the elimination of ambiguity, not its reduction, is acceptable. Such a misreading might result in skepticism about quasi-experimental devices that reduce, but do not eliminate, ambiguity.

He goes on to discuss the technical issues of identification (of model parameters), and that observational studies can still provide information when they might not provide identification.

Regarding Basu’s definition of identification:

Blockquote
The entry defines identifiable to mean in different states of the world … yield probability distributions for observable data that are themselves different. One could misread this statement as saying we learn nothing about \theta unless there is identification, nothing unless there is a consistent test for each level of \theta. More careful than most, Basu is aware we often learn in nonidentified situations.

From a decision theoretic perspective, this reduction in ambiguity might be enough evidence to promote an intervention or change policy. But that is context sensitive and cannot be decided a priori.

Further Reading

@R_cubed I think you’ve overstated things. A huge problem with medical practice is that randomized trials are not done at the right time. As Thomas “randomize early and often” Chalmers frequently argued, randomization should be undertaken before medical opinions are entrenched. The fact that clinicians (like most other experts) feel that they know what’s best, without data, leads to a perception of non-equipois that prevents clinical trials from being done later. History is full of medical reversals once randomized trials are actually done. Better decisions are made when we admit what we don’t know and try to find out what we need to know. We many many fold more clinical trials that we currently do, and we need to find out how to make trials easier to do. Bayes can help a lot with that, in addition to finally figuring out that we don’t need to collect as much data as we currently do.

2 Likes

If you are going to claim that my position is “logically false”, then please show me that it relies on a logically invalid syllogism. Alternatively, please stop appropriating terms that have precise definitions in order to borrow authority for your position.

In my view, the overall discussion about EBM conflates two separate questions:
(1) To what extent should we rely on statistical evidence vs mechanical/biological understanding when we predict the consequences of clinical decisions?
(2) If we rely on statistical evidence, to what extent should data from randomized trials be preferred over observational data?

The first question is interesting, and I do not claim to have a conclusive answer. It is my intuition that human biology is so complex that if we tried to make decisions based on biological understanding, it would almost always lead to unpredictable consequences. Since most of these interventions are going to be used in a large number of people with similar clinical presentations, I think it usually makes sense to empirically observe how the intervention works in practice, but I am not going to claim that this will universally be the case. Perhaps we can have a future of “Star Trek medicine” where advanced technology can scan the body and use its complete understanding of mechanism to accurately predict the consequences of intervention. And perhaps there are interventions that exist already, that are known to interact primarily with some completely understood subsystem of the human body, such that it is obvious that the benefits are real and large, and cannot possibly be outweighed by unpredicted adverse effects. I can believe that ECMO and parachutes fall into this category.

The second question is really not even worth discussing. Any rational decision maker should give much more credence to the validity of their beliefs, if those beliefs are informed by a large and properly conducted randomized trial, compared to if their are informed by observational data. This completely dwarfs almost all other considerations, such as sample size. For most medications, without randomization it is simply not possible to get any real confidence that we know its effect, no matter how good the data is in terms of every other relevant consideration.

When I said that “EBM was correct” I said specifically that they were right about randomization vs observable evidence. I do not support the stupid form of evidence based medicine that insists on having evidence from randomized trials for everything. Even without evidence, you will always have beliefs, and you should act to optimize the consequences of your actions given those beliefs.

The first essential point is that if you want to calibrate your beliefs to reality by analysing data, you can either do that correctly using randomization, or you do so in a noisy and biased way with observational data. If you choose the second approach, your beliefs are not going to improve much.

The second essential point is that the human body is incredibly complex and responds unpredictably to those interventions that were not present in the ancestral environment. Adverse effects can often be catastrophic. It is fundamentally rational to insist that we test the medication in a small group of people, and rigorously observe its effect, before concluding that we “know” the consequences of using that drug. But that doesn’t mean you let that get in the way of acting on your beliefs if you have enough certainty that the benefits exceed the costs, based on a mechanistic understanding of human biology. And it most certainly does not mean you cannot act even when you know that the consequences of not acting are dire, such as with ECMO or parachutes.

1 Like

Blockquote
If you are going to claim that my position is “logically false”, then please show me that it relies on a logically invalid syllogism. Alternatively, please stop appropriating terms that have precise definitions in order to borrow authority for your position.

While I think the entire history of human beings learning things long before randomization was discovered is proof enough, Royall already rebutted this demand for randomization in the paper I cited in his discussion of the Likelihood Principle:

Blockquote
Statistical theory explains why the randomization principle is unacceptable. It does this in terms of the concepts of conditionality (ancillarity) and likelihood… The conditionality principle asserts that when there is an ancillary statistic C present, inferences should be based upon the observed value of C. The problem this creates for the randomization principle is that the statistic representing the result of the randomization is ancillary; thus the conditional randomization distribution is degenerate, assigning one to the actual allocation used … the only “inference” on the observed data is “I saw what I saw”.

My in thread post of a quote from Dennis Lindley RE: randomization being useful but not necessary is also relevant.

Design considerations are very important before the data is collected (for the experimenter), but the Likelihood Principle does not use that to create “hierarchies of evidence” after the data is collected. There should be no a priori reason for a reader to grant randomized studies greater weight than observational ones simply by a pre-data design criterion.

Goutis, C., & Casella, G. (1995). Frequentist Post-Data Inference. International Statistical Review / Revue Internationale de Statistique, 63(3), 325–344. Frequentist Post-Data Inference on JSTOR

Blockquote
The end result of an experiment is an inference, which is typically made after the data have been seen (a post-data inference). Classical frequency theory has evolved around pre-data inferences, those that can be made in the planning stages of an experiment, before data are collected. Such pre-data inferences are often not reasonable as post-data inferences, leaving a frequentist with no inference conditional on the observed data.

None of this is important if powerful interests can manipulate what studies are published, and which ones are not, via subversion of the peer review process, which Ioannidis discusses in those papers.

Much to the disappointment of those who want easy answers, there are no rules based upon design features alone, that can decide this a priori, especially when there is an appreciable probability of fraud or deception. My position is that needs to be done on a case by case basis, according to formal decision theoretic principles. Michael Rawlins presents many historical examples of medical learning from what we know are imperfect data, that many EBM proponents would find problematic.

The world we have to deal with are:

  1. large economically and politically powerful, and coordinated actors who can afford to produce randomized designs (or the illusion of them) vs
  2. disorganized and politically weak agents who might be able to rebut biased presentations of RCTs with observational evidence or approximate the relevant RCT when powerful interests have no incentive to conduct a credible RCT.

That is the problem I am worried about, and am working out how to rigorously formalize the notion.

Related Reading

Rubin, D. B. (1992). Meta-Analysis: Literature Synthesis or Effect-Size Surface Estimation? Journal of Educational Statistics, 17(4), 363–374. https://doi.org/10.3102/10769986017004363

Blockquote
In contrast to these average effect sizes of literature synthesis, I believe that the proper estimand is an effect-size surface, which is a function only of scientifically relevant factors, and which can only be estimated by extrapolating a response surface of observed effect sizes to a region of ideal studies. This effect-size surface perspective is presented and contrasted with the literature synthesis perspective.

In a Robust Bayesian Meta-Analytic approach, design considerations could be treated as a nuisance factor, and integrated out.

There are some errors in statistical reasoning here, but the main points are important.

Blockquote
Seemingly well designed, executed, and reported, RCTs with exciting results can also be misleading due to the hijacked research agenda. These trials are designed to deceive and the methods of deception are alarmingly simple, but effective. The main tactics used relate to the choice of comparators, the choice of outcomes, and the manipulation of statistics to produce desired outcomes, and selectively report them.

Nothing in EBM textbooks or literature prepares a scholar for the adversarial context of the “real world.”

A good discussion on individual response/personalized medicine has been going on in twitter with several tweets from Judea Pearl. I hope that we can summarize and refactor that discussion here. My suggestion for doing so is to take a hypothetical but realistic example where we have a treatment benefit for females and an apparent benefit for males based on a randomized clinical trial, but under the surface there are really two populations of males based on the presence or absence of a genetic variant that was not designed to be collected and analyzed in the RCT. Males with the initially hidden variant have a disastrous outcome if treated. I’ll leave it to others to insert some numerical examples for the basis of discussion.

2 Likes

And therein lies the rub.

Hypothetical example:

A variant of a certain allele, when present on the Y chromosome (i.e., it only has the potential to occur in men), causes a fatal adverse drug reaction when carrier men are exposed to a certain medication. Only a small fraction of all men (e.g., men from a certain limited geographic location in the world) might carry this allele. No men from this geographic location were included in the RCTs used as the basis for drug approval.

Hypothetical AND Realistic example:

None (?) I can’t think of a single recognized clinical scenario that would conform to the hypothetical scenario above. Maybe others can offer one…

3 Likes

You are very correct and these twitter conversations are the ones I mentioned above recapitulating old Fisher vs Neyman debates, which contradicts the claim that 20th statistics was supposedly causally blind.

In contemporary terms, Neyman was focused on estimating the average treatment effect (ATE) in a randomized controlled trial (RCT). This can be zero even when some or all patient-level treatment effects differ from zero. Because of this he was interested in the long-run operating characteristics of statistical procedures under both random sampling and random allocation, which we discussed here. This is very unrealistic and that is why we essentially never use it in practice, at least not in medicine.

Fisher on the other hand did not care about the ATE and instead focused on a form of abductive inference from RCTs assuming only random allocation and not random sampling. He accordingly tested the sharp null hypothesis that the effect is zero for every single unit. This assumes it is unrealistic to expect that there is benefit for one subgroup and detriment to another with both equalizing to zero.

@Stephen, an applied statistician with tons of experience and thus closer to the Fisherian practical view, summed this up here as: 'Fisher’s null hypothesis can be described as being, “all treatments are equal”, whereas Neyman’s is, “on average all treatments are equal”.’ The ATE framework can also violate Nelder’s marginality principles, which may be one conceptual source of disagreement behind the Lord’s paradox debates with Pearl.

In the contemporary causal inference world, these considerations are well summarized, e.g., by Guido Imbens and Donald Rubin here.

This disconnect between theory and practice can yield endless debates, which can be quite fruitful and useful to follow albeit exhausting for participants.

5 Likes

**Addendum after further reflection and modification of the hypothetical scenario described in post 198 above- the modification is designed to illustrate how drug regulators approach rare adverse events:

The reason why I said that I couldn’t conceive of a “realistic” clinical scenario exactly like the hypothetical one above, is that I couldn’t think of an example of an idiosyncratic drug reaction that is known only to occur in males. The scenario becomes much more plausible if we don’t require that the underlying genetic predisposition for the adverse reaction occurs only in males. Instead, let’s imagine that there could be a genetic allele present in a small number of men OR women, that confers risk for a fatal drug hypersensitivity reaction (as described below).

Let’s pretend that the new drug in question is an antimicrobial designed to combat a chronic, hard-to-treat infection. In a clinical trial, its ability to “cure” patients of their infection was compared with the “standard of care” antibiotic. Let’s further pretend that the new antibiotic performed very well in the trial- a much higher proportion of patients treated with the new antibiotic were cured compared with those treated with the standard of care antibiotic. This effect was seen in both men and women, but, unfortunately, one male subject died within 4 days of exposure to the new antibiotic from a multi system hypersensitivity reaction.

What would be the likely outcome from the above trial? Answer- it is very unlikely that the new antibiotic would be approved by the drug regulator. The regulator would be concerned about the potential for additional idiosyncratic reactions to the new drug following marketing. After observing only one event, it would be difficult to identify any occult genetic cause (if present). Therefore, the regulator would realize that prescribers wouldn’t be able to mitigate the risk through pre-treatment patient screening. Furthermore, seeing such an event occur in a modestly-sized trial would raise serious red flags about the potential frequency of this adverse event in the wider target population if the drug were to be approved.

In theory, idiosyncratic reactions are possible with most drugs on the market. But because these types of reactions are, for the most part, known to be very rare, a regulator would become very concerned if such an event were to be captured during a typically-sized clinical trial.

Now let’s modify the scenario, such that the (unknowingly) genetically “vulnerable” male patient had not been invited to participate in the trial. The trial again generates impressive efficacy results, but this time, no cases of hypersensitivity reaction occur. The antibiotic is approved by the regulator. But lo and behold, within a week of its approval (now in the “postmarket” setting), the regulator starts receiving sporadic case reports describing fatal hypersensitivity reactions among patients treated with the new antibiotic. Now what? Answer: This situation will become an urgent priority for the regulator, with letters of warning sent to prescribers, and, perhaps, even instructions to stop prescribing the drug. There will be a concerted effort to try to estimate the frequency of such events.

The more people that have been exposed to the drug before such events are reported, the more confident the regulator can be that the incidence of the idiosyncratic reaction, in target population for the drug, is likely to be “acceptably” low (with the “acceptable” rate defined by the regulator). This is why, even though idiosyncratic reactions have been reported with many drugs, many of them nonetheless remain on the market to this day. If a drug has unique and important benefits for treating a serious condition, and if its risk/benefit profile, for the vast majority of patients who use it, is expected to be favourable, then more harm than good can come, at a population level, from withdrawing the drug than leaving it on the market with cautionary advice to prescribers (e.g., to be on the alert for early signs of hypersensitivity reaction so that the drug can be stopped promptly). In some cases (e.g., certain antiepileptics), we have identified genetic alleles that increase risk (allowing us to screen patients before treatment), but in many cases we have not. This is an accepted risk with the practice of medicine.

4 Likes

I am not sure what Judea Pearl was trying to do by offering this example based on alleles and tasty medication. His and Scott’s paper contains the necessary data to illustrate their argument of how to place bounds on the probabilities of 4 counterfactual situations of (1) survival on control and treatment (2) no survival on control but survival on treatment (‘benefit’), (3) survival on control but not on treatment (‘harm’) and non-survival on control and treatment. I proposed one hypothetical narrative that would make sense of their data from a clinical point of view (see Individual response - #65 by HuwLlewelyn). With imagination, there could be many such appealing narratives but the allele / tasty drug example as proposed by Judea Pearl does not appear to be one of them.

It seems to me that dreaming up appealing illustrative narratives is not the issue. From my viewpoint there are 4 important questions:

  1. The paper’s use of the word ‘benefit’ and ‘harm’ in a counterfactual situation differs from that used when describing the probabilities of an outcome conditional on control and treatment and designating the treatment beneficial or harmful.
  2. If we could derive probabilities for the above, how would they be used to make medical decisions by also taking into account probabilities of adverse effects and their various utilities (i.e. effects on well-being)? In other words, what is the purpose of calculating probabilities of these 4 counterfactual situations?
  3. They are estimating the probabilities of counterfactual situations that are by definition inaccessible for the purposes of verification or calibration, unlike other models of prediction for example.
  4. In view of (3) are their assumptions and reasoning about using various results from RCTs and observational studies to arrive at inequality probabilities of these counterfactual situations sound, culminating on page 8 of the paper?
3 Likes

@Pavlos_Msaouel I want to rethink this and disagree with you a bit.

The problem with overall survival (OS) in oncology studies stems solely from the existence of rescue therapy that can prevent or delay death (and can also prevent cancer recurrence but we are talking mainly about post-recurrence changes in therapy). I posit that most any method that tries to estimate OS will be hard to interpret. The only easy clinical interpretation comes from estimating things like the probability of either the need for rescue therapy or death. A state transition model can easily distinguish between need for rescue therapy with and without a later death, and it can count death as worse than rescue therapy. State occupany probabilities can be computed, by treatment, for death, rescue tx or death, rescue tx and alive, etc.

1 Like

Love it! Obviously there are currently differences but this is an open problem and glad to see you using your skills and experience to address it your way.

In fact, if I recall correctly, part of the motivation for writing that DFS vs OS post at the time was that you had posted a comment on another datamethods thread at the time talking about your state transition model approach in the context of COVID-19 trials. Intuitively I can see connections but unable to go deeper, in part also because the cancer that truly motivates my research (renal medullary carcinoma) is highly aggressive and we are not yet at the point therapeutically where we need to use new methods for survival estimation. Thus, I think about this topic far less than I do other challenges.

While our team has very elaborate and efficient Bayesian non-parametric models we use to attack this problem, it is not certain that they will be the optimal approach. A major reason why is that they are time-consuming, hard to intuit/interpret, and lack user-friendly tools. On the other hand, you are an expert with decades of experience at creating popular and powerful modeling tools for the community. Would love to see your group approach these problems.

1 Like

Thanks Pavlos. Ordinal longitudinal Markov state transition models have lots of advantages of interpretation as detailed here, besides being very true to the data generation process. Advantages come from the variety of causal estimands through the use of state occupancy probabilities, and the fact that all of these estimands are simple unconditional (except for conditioning on treatment and baseline covariates) probabilities. What I thin k is needed to make this work in your context are

  • rescue therapies must have a clinical consensus around them to be considered for the list (and note that you can distinguish various levels of “rescue” with an ordinal outcome, e.g. surgical vs chemo vs radiation vs chemo+rad)
  • in a multi-center RCT the practice patterns for use of rescue therapy are fairly uniform or can be somewhat dictated by a protocol

To me the only problems that are really hard to solve in this context are the existence of non-related follow-up therapies and non-related causes of death such as accidental death.

1 Like

I think these excellent considerations deserve an ongoing discussion/panel particularly with regulators such as the FDA because the challenge is becoming progressively more common across diseases.

Right now I’m trying to start just such a project at FDA but for neurodegenerative disease.

Presently at FDA rescue therapy is somethat that is worked around rather than directly addressed, thinking of it as more of a censoring event than an outcome. That always makes results hard to interpret to me.

1 Like

I have gone over the paper again carefully. The assumption on which the whole paper is based is ‘consistency’. To quote from the paper: “At the individual level, the connection between behaviors in the two studies relies on an assumption known as ‘consistency’ (Pearl, 2009, 2010), asserting that an individual response to treatment depends entirely on biological factors, unaffected by the settings in which treatment is taken. In other words, the outcome of a person choosing the drug would be the same had this person been assigned to the treatment group in an RCT study. Similarly, if we observe someone avoiding the drug, their outcome is the same as if they were in the control group of our RCT. In terms of our notation, consistency implies: P(yt|t) = P(y|t), P(yc|c) = P(y|c).”

However, according to their example data for females:
P(yt|t) = 489/1000 = 0.489, P(y|t)= 378/1000 =0.378, P(yc|c) = 210/1000=0.210, P(y|c)=420/600 = 0.7
So that P(yt|t) ≠ P(y|t), P(yc|c) ≠ P(y|c)

According to their example data for males:
P(yt|t) = 49/1000 = 0.49, P(y|t)= 980/1400 = 0.7, P(yc|c) = 210/100=0.210, P(y|c)=420/600=0.7
So that P(yt|t) ≠ P(y|t), P(yc|c) ≠ P(y|c)

This means that the assumption of consistency is not applicable to the paper’s example data of their RCT and observational study. However, they go on to make the calculations nevertheless “based on this assumption (i.e. of ‘consistency’), and leveraging both experimental and observational data, Tian and Pearl (Tian and Pearl, 2000) derived the following tight bounds on the probability of benefit, as defined in equation (3): P(benefit) = P(yt, y′c). Therefore the estimated probability bounds in their inequality equation (5) do not follow from their assumptions and reasoning. However, by applying these probability bounds they arrive at point estimates of the probability of counterfactual ‘benefit’ and ‘harm’ (the latter are defined in my previous post (Individual response - #205 by HuwLlewelyn).

I created new example data for males and females where the RCT data were identical and the proportions choosing to take the drug and not to take it were the same as in their observational example. However in my new observational study data, P(yt|t) = P(y|t) and P(yc|c) = P(y|c) so that the assumption of ‘consistency’ could be applied. I then applied their calculations to this data that I had created.

Instead of getting point estimates I got probability ranges. For females the probability of ‘benefit’ counterfactually (e.g. by giving treatment, turning the clock back and giving placebo) was between 0.279 and 0.489 and the probability of ‘harm’ was between 0 and 0.21. For males the probability of ‘benefit’ was between 0.28 and 0.49 and the probability of ‘harm’ was between 0 and 0.21. (p(Harm) = p(Benefit) – CATE, which was 0.279 and 0.28 for females and males respectively). The calculations are in the Appendix below. If the assumption of consistency is valid, we can tell all this from the RCT alone: p(Benefit) ≥ Pr(yt) - Pr(yc) and P(Benefit) ≤ Pr(yt) and p(Harm) = p(Benefit) - (Pr(yt) - Pr(yc)), so that the observational study adds nothing in this context. As we have discussed already, observational studies can be useful in other ways such as detecting adverse effects.

Appendix
Female RCT and ‘consistent’ observational study

Calculations for females (replacing those on pages 8 and 9 in the paper), when Pr indicates that the probability is from the RCT and Po indicates that the probability is derived form the observation study:
P(Benefit) ≥ 0
P(Benefit) ≥ Pr(yt) - Pr(yc) = 489/1000 -210/1000 = 279/1000 = CATE = 0.279
P(Benefit) ≥ Po(y) - Pr(yc) = 0.4053 – 0.21 = 0.1953
P(Benefit) ≥ Pr(yt) - Po(y) = 0.489-0.4053 = 0.0807

P(Benefit) ≤ Pr(yt) = 489/1000 = 0.489
P(Benefit) ≤ Pr(y’c) = 790/1000 = 0.79
P(Benefit) ≤ Po(t, y) + Po(c, y’) = 686/2000 + 474/2000 = 0.343 +0.237 = 0.58
P(Benefit) ≤ Pr(yt) − Pr(yc) + Po(t, y′) + Po(c, y) = 0.489-0.21 +0.357+0.063 = 0.6997
Therefore:
0.279 ≤ p(Benefit) ≤ 0.489 and (0.279-0.279) = 0 ≤ p(Harm) ≤ 0.21 = (0.489-0.279)

Male RCT and ‘consistent’ observational study

Calculations for Males (replacing those on pages 8 and 9 in the paper):
P(Benefit) ≥ 0
P(Benefit) ≥ Pr(yt) - Pr(yc) = 490/1000 -210/1000 = 280/1000 = 0.28 = CATE = 0.28
P(Benefit) ≥ Po(y) - Pr(yc) = 0.406 – 0.21 = 0.196
P(Benefit) ≥ Pr(yt) - Po(y) = 0.49-0.406 = 0.084

P(Benefit) ≤ Pr(yt) = 49/1000 = 0.49
P(Benefit) ≤ Pr(y’c) = 790/1000 = 0.79
P(Benefit) ≤ Po(t, y) + Po(c, y’) = 686/2000 + 474/2000 = 0.343+0.237 = 0.58
P(Benefit) ≤ Pr(yt) − Pr(yc) + Po(t, y′) + Po(c, y) = 0.49-0.21 +0.357+0.63 = 0.7

Therefore:
0.28 ≤ p(Benefit) ≤ 0.49 and (0.28-0.28) = 0 ≤ p(Harm) ≤ 0.21 = (0.49-0.28)

4 Likes