I have decided to delete the previous post and put a more intuitive one for possible comments from @scott and @HuwLlewelyn that may be more helpful in understanding the issues raised
I saw an exchange recently on Twitter and thought it was relevant to this thread. One poster wrote:
Can we express this sentence in mathematics: “similar patients given identical treatments will have different values in different studies”.
The assumption overriding counterfactuals is that Y(1, u) and Y(0,u) exist, and are immutable properties of u (the patient). Is it wrong?
Clinically speaking, the answer is “yes,” this assumption usually IS wrong when “u” is a human being. And it is exactly this fundamental misunderstanding that clinicians and statisticians find so frustrating about the hype surrounding the potential for “personalized medicine.”
Those who traffic, professionally, in stochasticity (as physicians and statisticians do), seem better placed to appreciate its scope (and therefore to give it the respect that it deserves) than those in other fields. Human biology/physiology and behaviour are each far more complex and far less predictable than a circuit board- and when human physiology and behaviour interact with each other in determining “response” to a treatment, look out- the number of possible outcomes is unfathomable.
Provided that nobody has tampered with the wiring in my house, I expect that when I turn off the breaker to my stove, it will shut off. When I flip the breaker the other way, I expect my oven to turn on. Sadly, patients are not as predictable as my oven. “Responsiveness” to a treatment is only credibly viewed as an “immutable” property of a patient in a very narrow set of clinical scenarios. Even in situations where a patient’s response to a given exposure has, historically been highly predictable (e.g., allergic reactions), responses nonetheless often attenuate over time. In oncology, where a tumour might initially respond to a treatment that blocks a biologic pathway driving the tumour’s growth, patients eventually, unfortunately, often stop responding to treatment.
Physicians have seen so many “unexpected” outcomes in their careers, that the unexpected is the only thing we have learned to expect in terms of patients’ response to treatment. If I have a patient with recurrent major depression, I am not the least bit surprised if the antidepressant that worked for her 5 years ago does not seem to work this time around. The same is true for treatment of many other conditions, including acute and chronic pain (e.g., migraine), therapies for substance abuse, epilepsy, lung disease, gynecologic disease, infectious disease…the list is endless. Rarely will a physician be surprised when a previously effective treatment does not generate the “expected” response.
An illustration of the pervasiveness of stochasticity in medicine and its impact on treatment “response:” I am not overly surprised if an otherwise healthy older patient who is anti coagulated for chronic atrial fibrillation nonetheless presents one day to the ER with a TIA. This is an “unexpected” event only in the sense that we had hoped that her anticoagulant would have made her absolute risk for TIA/stroke very low. But then, in follow-up a few days later in my office, the mystery is solved, when, on taking a careful history, the patient recalls that she had been distracted by an unplanned visit from her daughter and forgot to take her DOAC for 3 days prior to the event…
In short, human biology/physiology changes constantly and a patient’s “response” to a treatment is affected not only by these changes (which, in turn, are often affected by his environment), but also by innumerable ways in which his comorbidities interact over time, by his behaviour/decisions (in which case, physiologic complexity is effectively multiplied by behavioural complexity), and by innumerable stochastic factors that are part of everyday life. Physicians know that there are very few “immutable properties of u.”
I agree that this Tweet was imperfectly phrased, and that it makes an unconvincing and strong claim about deterministic counterfactuals. I had anticipated that there would be responses such as this one, which are very reasonable, and I had already planned to add my thoughts on Twitter. I can no longer find the original tweet, it appears as if the author may have recognised that it could be misinterpreted and deleted it (?). Either way, thank you for giving me an opportunity to respond here instead, without being constrained to 140 characters!
When using causal models, you can either rely on an ontology with deterministic counterfactuals (as in the tweet) or one with stochastic counterfactuals (where every individual u has an outcome that is drawn randomly from their individual counterfactual distributions f_u(Y_u)(1) and f_u(Y_u)(0)). My understanding is that almost all the foundational results in causal inference are invariant to whether the causal model uses deterministic or stochastic counterfactuals, but that it is sometimes didactically useful to focus on deterministic variables, which are often easier to understand. Given that the results hold under either model, this simplification often appears justifiable, not least because it significantly reduces notational load.
I fully agree that treatment outcomes are highly situational and often unpredictable, that they depend on a large number of factors which are too complex to attempt to model explicitly and therefore best understood as “stochasticity”. But the crucial point I want to make, is that if we want to make decisions to optimize outcomes, it is necessary to find a structure to the randomness: If everything was fully random (without any structure), we could never have any rational reasons for preferring one treatment option over another.
Stochastic counterfactual distributions are ideally suited for representing the relevant kind of structure. If a person’s outcomes depend on a specific aspect of physiological complexity, there are some settings where this may be irrelevant to your decision making (such that it is acceptable to consider it “randomness” or “noise”), and other settings where causal reasoning will reveal that it needs to be tackled heads on in the analysis. Only with a causal model is possible to fully clarify the reasoning that optimises predicted outcome under intervention, and these causal models will use something like a counterfactual distribution to represent the structure that it imposes on the stochasticity.
I don’t think anyone in causal inference expects treatment response to be constant over time. When they talk about counterfactuals as “immutable personal characteristics”, I believe they are talking about a highly situational construct that is in some sense “known to God” and that represents what will happen to the patient if they take treatment at some specified time, but which is certainly not assumed to be stable over time. At the very least, the original Tweet should have given a time index to the counterfactuals.
The distinction between stochastic and deterministic counterfactuals is philosophically very interesting, and there may exist multiple rational ways to think about these constructs. I think it was a mistake for the original tweet to imply that causal inference depends on a “realism” about deterministic counterfactuals.
Wonderful post. @Stephen has written extensively about this. My simple take on this work is this:
- If you enter the same patient multiple times in a clinical trial and each time measure their change from baseline in systolic blood pressure you’ll get a different change each time. Patients are inconsistent with themselves so there is a limit to “personalization”.
- If you did a 6-period 2-treatment randomized crossover study you can estimate the treatment effect for individual patients if blood pressure is measured precisely enough. That means that we can estimate the long-term tendencies for a patient (average of 3 measurements per treatment). But we still can’t estimate very well a patient’s current singly-measured blood pressure.
Thanks for your thoughtful response.
I think this is really the crux of the issue. Some might argue that trying to find structure in randomness is to deny the very existence of randomness…Opinions about whether or not it’s futile to even try to find patterns in chaos seem (crudely) to delineate two ideologies- causal inference epidemiology and statistics.
In fairness, I don’t think that views of the two camps are as black and white as this. Plainly, causal inference proponents don’t deny the existence of randomness and statisticians don’t deny the possible existence of occult cause/effect relationships in the world. Rather, the two groups seem to differ in their opinions about what type(s) of evidence will allow us to make the best “bet” when we are treating our patients, with a view to optimizing their outcomes.
Causal inference epidemiologists seem to feel that the world isn’t quite as random as many believe, and that if we can chisel out even a few more cause-effect relationships from the apparent chaos, then maybe we can make better treatment decisions (?) Maybe there’s some truth to this view, but if the number of clinical scenarios in which it might apply is small (e.g., conditions with strongly genetically-determined treatment responses), then the cost/effort involved in trying to identify such relationships could easily become prohibitive. In the long run, addressing social determinants of health would likely pay off much more handsomely with regard to improving the health of societies. Seeing governments/research funders throw huge sums of money toward what many consider to be a fundamentally doomed enterprise is aggravating, to say the least.
But getting back to the paper linked at the beginning of this thread: Maybe I’m misconstruing, but the authors seem to believe that it’s possible to infer treatment-related causality for individual subjects enrolled in an RCT, simply by virtue of the fact that they had been randomly assigned to the treatment they ended up receiving. For example, they seem to assume that anybody who died while enrolled in a trial must have died as a direct result of the treatment (or lack thereof) he received during the trial. In turn, it seems that this belief stems from a more deep-seated conviction that a patient’s probability of “responding” to a treatment is somehow “predestined” or engrained in his DNA, and therefore will be consistent from one exposure to the next. People who have posted in this thread are trying to point out that this conviction is incorrect. And if this is the underlying assumption on which the promise of “personalized medicine” hinges, then health systems are throwing away a whole lot of money in trying to advance the cause.
If you were to review case reports for all subjects who died (unfortunately) while they happened to be enrolled in a very large, longterm clinical trial, you wouldn’t be surprised to find some subjects who died from airplane crashes, slipping on banana peels, pianos falling from the sky, aggravated assault, accidental drug overdose, electrocution, and myriad medical conditions that were completely unrelated to the treatment they received (e.g., anaphylactic reaction to peanuts, bacterial meningitis outbreak, forgetting to take an important medication…). Events like this are recorded in both arms of clinical trials and have nothing to do with the treatment being tested. Presumably, though, the more deaths that are recorded, the more convinced we can be that between-arm differences in the proportion of patients who died might be due, at a group level, to the treatment in question, rather than simple bad luck.
Even if the tweet referenced above is now deleted, there’s plenty of circumstantial evidence that the author fundamentally believes that patients can be viewed like circuit boards- as though they are intrinsically “programmed” (like a computer) to respond the same way, whenever the “input” is the same.
I’m not a statistician nor an epidemiologist. So I don’t know how to phrase what I’m trying to say using math. But after practising medicine for 25 years, I’m not sure that it’s possible, for physicians to make a better “bet” (in most, but not all, cases) regarding the treatments we select for patients than one that is grounded in the results from well-designed RCTs. This approach seems completely rational to me. Conversely, I perceive innumerable ways that we could fool ourselves in the process of trying to identify engrained “individual responses” in a sea of potentially random events.
This needs clarification: There are at least two major types of probabilities used in stochastic inference:
-
Aleatory probabilities that are connected to physical processes such as the random treatment allocation in randomized controlled trials (RCTs). This is randomness that is based on a well-defined physical process and its uncertainty can thus be validly quantified by standard statistical methodology.
-
Epistemic probabilities that express our ignorance.
The two can be numerically equivalent and considered to express “randomness”. But they are fundamentally different as nicely described, e.g., here.
Because frequentism focuses on aleatory probabilities whereas Bayes allows both, these considerations can degenerate into the endless frequentist vs Bayes debate that would be counterproductive in this thread. Whether using a Bayesian or frequentist lens, a major task for physician scientists is to convert epistemic probabilities into aleatory ones as much as possible, chiefly through experimental design along with careful observations such as correlative analyses of patient samples. These need to be embedded in statistical models informed by causal considerations.
Almost every procedure used by “classical” statisticians imposes structure on randomness. Without such structure, there would be no way to do any kind of inference.
The difference between classical statisticians and causal inference epidemiologists, is not that epidemiologists assume “more” structure and “less” randomness. In practice, we use the same estimation procedures, and to the extent that we have different preferences about model choice, those differences do not reflect different allowances for “randomness”.
Rather, the difference is that we (that is, epidemiologists) insist on having a language for reasoning about whether the assumptions we make about the structure of randomness are consistent with our beliefs about how reality works, so that those assumptions can be evaluated as an integral (and essential) part of the overall scientific inferential procedure.
Without cause-effect relationships, statistics gives absolutely no basis for making rational decisions, you might as well read tea leaves. Cause and effect is there whether we believe in it or not. We can either tackle this heads on with a scientific language for determining whether and how we can learn about causal effects from the data, or we hide our heads in the ground and hope that we magically get the right causal answer from non-causal statistical inference.
There are some statisticians who deny the usefulness of the counterfactual language. In my view, they are invoking the magical category of “randomness” to sweep the issue under the carpet, unilaterally declaring that their preferred modelling approach is the canonical one-size-fits-all procedure for imposing structure on randomness, even when alternatives are just as consistent with living in a stochastic world.
Finally, I want to note that despite its exaggerated claims of importance, the paper that that is discussed in this thread is a very idiosyncratic approach that can only be used in some highly artificial settings, and even then, with highly questionable utility. It is most certainly not an accurate summary of current thinking in personalized medicine, and it does not reflect consensus among the causal inference crowd.
This depends on what you mean. When applied to a causal design (e.g., a randomized experiment where there is no post-randomization trickery) causal language is hardly needed at all.
The real problem with causal epidemiology is when the rubber hits the road. Lots of methodologists talk about notation and theory but can’t give us a real complete case study based on real data – a case study in which the DAG is justified by the subject matter and all needed measurements are available in the data. A case study where the rest of us can learn how to do real and not theoretical causal inference. See the call for examples here.
Disagree. Causal language is absolutely pertinent to both my lab experiments and clinical trials. Just finalized and sent to my co-authors a draft manuscript showcasing how causal inference can inform the interpretation of RCTs in ways that you would very much agree with. In fact, I am using your RMS package among other tools to provide practical examples.
Finally we have found something we can mostly agree upon (even if others from “my side of the aisle” might take issue). In my view, the vast majority of the utility of randomized trials comes from the ITT analysis; and while the ITT analysis can certainly be understood from the perspective of causal inference, the required “causal” methodology is so trivial that there is no clear benefit to formalizing it.
The real problem with causal epidemiology is when the rubber hits the road. Lots of methodologists talk about notation and theory but can’t give us a real complete case study based on real data – a case study in which the DAG is justified by the subject matter and all needed measurements are available in the data. A case study where the rest of us can learn how to do real and not theoretical causal inference. See the call for examples here.
I would even mostly agree on this. It is indeed rare that DAGs are justified by subject matter knowledge, and I have very little confidence in most applications of observational causal inference. However, that is in no way an argument in favour of using classical statistics applied to observational data. Such analysis will have all the same problems, and just lack a framework for clarifying why its conclusions are likely biased.
As I have previously stated on Twitter, the vast majority of the benefit of the causal inference framework is going to arise from the incorrect causal conclusions that it helps us avoid, rather than the correct causal inferences that it assists us in making. Causal inference makes it possible to evaluate the plausibility of the assumptions that are required for the study to provide unbiased estimates of something that matters for decision making. In practice, a sincere analyst will almost always conclude that those assumptions are not plausible. In most settings, decision makers would be right to insist on randomized trials. The “Evidence Based Medicine” movement was fundamentally correct in their assessment of observational evidence (whether analyzed with traditional or causal methods).
I do however believe there are some settings where causal inference is worthwhile. In my view, the best “case studies” for showcasing causal inference from observational data , will almost always be post-marketing studies on the adverse effects of medications. These are high-stakes decisions where we need to rely on the best available evidence, even if that evidence is flawed. Adverse effects tend to be very rare (meaning that RCTs are usually underpowered to detect them). Moreover, unintentional effects are much less subject to confounding by indication, meaning that it is much more plausible that we will be able to control approximately for confounding.
It is true that in most cases when a drug is convincingly found to have an adverse affect, the safety signal will be so strong that there is little risk of getting a different result if we rely on non-causal statistics. But if we are going to rely on observational data, I don’t think it hurts to do it correctly…
Love disagreements! The ITT analysis allows physically justifiable measurement of uncertainty. However, for clinical practice, as opposed to health policy, it is the PP analysis (or “as-treated” more often used for medical devices) that is actually more pertinent. And much harder to debias without the use of causal tools. Nice overview here, as I am sure you are aware.
Excellent note. I especially like this:
I think you are overrating the value of causal language for clinical trials that are in the ITT mode. I am very will to say succintly in such cases that from our data generating model E(Y|X, tx=B) - E(Y|X, tx=A) is our causal estimand.
Now something to really disagree on! As-treated/PP is not very useful to the physician making (with the patient) a treatment decision at time zero. It doesn’t yield time-forward prospective estimates.
Nope. See this thread why even in full ITT mode trialists will make tons of mistakes when not thinking about the processes generating the RCT data. This is where the whole mistaken notion is based upon whereby simple correlations between overall survival (OS) with an intermediate endpoint magically assume the OS estimate as gold standard despite using a bogus OS estimand. The field of dynamic treatment regimes in statistics evolved to protect us against exactly these mistakes.
This mistakes valid statistical inference (ITT) with the estimand we clinicians truly want. Take for example this RCT that recently created commotion on twitter. In the ITT analysis the “colonoscopy” group is patients who got allocated to receive an invitation to undergo colonoscopy. When I discuss in clinic with my patients, we are interested on what happens if they actually get the colonoscopy. Not what happens if they receive an invitation. That’s because they won’t receive such an invitation by a trial group. We will make decisions together on whether or not to actually do the colonoscopy. And for that we need to estimate the potential outcomes of actual colonoscopy versus no colonoscopy. This is much harder to estimate than the ITT. But it is nevertheless what we actually want.
I can see this more for the colonoscopy example than for medications. The point about meds is that it’s not “do this now or don’t do it at all” but rather degrees of adherence, and the adherence over time is unpredictable. The ITT estimate averages over adherence observed in the trial, assumes adherence in the field is fairly similar, and that is our current best guess of what benefit the patient will receive.
If you really want a hypothetical “if you adhere to the treatment fully” estimate I’m sure we’ll both agree that decent estimates of that come only from RCTs where under one assumption you have a perfect instrument for an instrumental variable analysis to estimate efficacy under perfect adherence. Causal inference comes into that.
I’ll read the links you provided before commenting on the other part. Thanks for a great dialog.
Exactly! I wrote something very similar on the aforementioned draft regarding approaching such problems as an IV analysis using RCT data. Will post a link to the article once it is out.
Both points I made above are two sides of the same coin. It gets even better: the advantage of Bayes is that it gives us the flexibility to intuitively model these challenges as hybrid problems of both ignorance and randomness (or other physical design processes such as blocking etc). However, we have to be constantly mindful of connecting our models with putative causal mechanisms when we do that. This was insisted upon primarily by traditional frequentists. It is good advice.
I’ve read the post which is really fantastic. So I conclude that what I was advocating applies to disease-free survival time but not to overall survival. For the latter, the ITT treatment effect estimates a policy estimand, e.g., compares those randomized to treatment B with all the subsequent treatment modifications that happened to them with those randomized to treatment A with all the subsequent developments happening to them. I think this is still a causal estimand, it’s just a policy estimand rather than a “we control what happens” estimand.
A side question for you: If we do a state transition model and use it to estimate state occupancy probabilites such as P(pt has disease returned by 6m and is alive) will that provide anything useful to the discussion? This is a simple unconditional probability (except for conditioning on treatment and baseline covariates). One can also get overall survival probabilities from this model, but they may have the same problem you wrote about (except that the model will fit better because you can allow different covariates for death vs. for the disease recurrence state).
Exactly! And this key point becomes clear when we draw the causal graphs. Otherwise it is very hard to see what the modeling challenges are here.
When thinking of overall survival, we are looking for decision rules that will lead to optimal long-term benefits. Such decision rules need more work to be estimated than standard RCT models will provide for chronic diseases, such as many cancers nowadays. Good problem to have, it’s a consequence of our patients living longer and better. But a challenge nonetheless for health systems, regulators, patients, clinicians, and methodologists.
I would expect yes. It is a more complete model. But having gone through the laborious exercise of estimating decision rules from RCTs that randomly allocated interventions sequentially (ideal scenario that almost never happens in oncology) what I learned is the importance of quality data. No amount of elaborate modeling can salvage an RCT that did not collect the right data. And the best way to see in the design phase ahead of time what we need is to draw the causal diagrams. Intuitively, people can sense that we’ll need information on subsequent therapies. But what the graphs reveal is that we’ll also need information on covariates at transition times. This way we can make sure to collect them.
One of my favorite models for such purposes is described in this JASA paper that generated a lot of discussion at the time within the statistical methodology community. Figure 1 that sets the challenge is a causal diagram. It is essential for the model. Causality lies at the foundations of all statistical modeling. In fact, some of the current ongoing twitter discussions that prompted this thread repeat heated arguments between Fisher and Neyman, just using different terminology. They both undoubtedly thought causally despite lacking today’s more rich notation and approaches.
Hi Pavlos
As noted previously, I think that the type of work you’re doing is really very unique. I hope that this uniqueness is being fully appreciated/recognized by your colleagues (I strongly suspect that it is).
You seem to belong to a very small/rarefied group of researchers in the world. Your specialization in oncology presents a multitude of important longstanding/unresolved challenges [e.g., how to identify therapies that, when used serially with other therapies, can be expected to improve a patient’s overall (not just progression-free) survival]. In turn, these challenges have, effectively, forced “outside the box” thinking, ultimately causing you to ask if there might be a role for causal modelling in optimizing RCT design. Then you went a huge step further, learning the language of causal inference epidemiology to see how this might work.
Now, after learning the language of causal modelling, you have identified an important potential niche for it in optimizing the design of oncology clinical trials. To me, this work feels like the perfect clinical application of DAGs, and the one with the most potential to impact patient care.
I suspect that a key problem, to date, has been that researchers trained in more modern causal inference methods have perhaps lacked the clinical background and/or clinical incentive to search for alternate applications of these methods that would be accepted by the clinical community.
Historically, clinicians/statisticians have balked at the seemingly never-ending promotion of DAGs as a way to derive causal inferences from observational data alone. Clinicians have pushed back, asking: 1) given that these methods have been around for years, why are we only seeing them used in a very small fraction of published observational research?; and 2) why would we ever believe that these methods can generate results that are reliable enough to influence patient care decisions (except, perhaps in the case of strong, consistent safety signals derived from well-conducted studies)? At the end of the day, it’s unlikely that we will ever consider non-randomized evidence to be on par with observational evidence for the purpose of assessing treatment efficacy.
What you have done, effectively, is to carve out a niche for these methods as a way to optimize RCT design (the study design that clinicians consider to be optimal, in most cases, for guiding clinical decision-making), de-emphasizing the historical push for their application as “stand-alone” methods to make causal inferences from observational data.
This is all very exciting…
You are very kind, thank you for summarizing these efforts better than I ever could!
A lot of this is just us standing on the shoulders of giants, many of whom are regulars on this forum. For example, @Stephen taught us that if something is not helpful in RCTs then it is even less likely to be useful in analogous observational studies, whereas the converse is not necessarily true. And it is also his insistence on debating the Lord’s paradox (latest post here with discussion and links to previous entries) that is nicely highlighting limitations / challenges for graph-based causal inference schools.
It is indeed true than when a framework is shown to work empirically, it gains acceptance. We are catching flaws in our studies earlier and design them more efficiently to learn from mistakes and gradually improve patient care. And because this makes more people get involved, they then bring their own unique perspective into the mix creating a nicely dynamic ecosystem.