When people on a statistics forum criticise causal inference experts in medicine, I’m going to insist on interpreting it as if the intended target of criticism corresponds to the actual experts working on causal inference in Medicine.
There are multiple books written on this topic. Pick up one of them, for example “Causal inference: What if?” By Hernan and Robins or “Causality” by Pearl. They both build a complete and coherent theory of causal inference which simply does not depend on choice of effect measure.
I just don’t get what would drive a person to make completely unsubstantiated claims that the risk difference plays a central role in causal inference, and then psychoanalyse causal inference experts to claim that their alleged defence of risk difference results from the fact that their theory depends on it
I was just using it informally to mean that the theory covers all the relevant inferential steps, and that you can always use it to find out if an estimand is identified given your causal model (and if so, what they identifying expression is). Maybe I should have phrased that differently. Either way, I don’t think it is very central to this discussion
To answer Anders request, which I think points to a view seen at least by the 1970s, when the modern “causal inference” movement was gestating: Suppose we are dealing with a real-world problem about comparing and choosing interventions, rather than starting as in statistics books with a contextually unmotivated parametric model. In that case we could say we want a family of distribution functions f(y_x;z,u) that predict, as accurately as possible given the available information, what the outcome y of individual unit u will be when given different treatments x; z is the unit’s measured-covariate level and so part of the information available for the task. There are cites expressing that goal or functional target going back to the 1920s, allowing for the radical changes in language and notation since. By 1986 Robins had extended x and z to include longitudinal treatment regimes and covariate histories in the outcome function, and by 1990 fairly stable terminology and notations appeared.
In this view, causal estimation or prediction then comes down to fitting models for the potential-outcome distribution family f(y_x;z,u), not some regression analog E(y;x,z,u) - although under some “no-bias” or identification assumptions made all too casually, the PO family will functionally equal a regression family. The key point now however is that having the function f(y_x;z,u) would render the issue of effect measures pointless for any practical decision-making.
The problem I see with most effect-measure debates is that they lose sight of the original nonparametric target. No one should care about ratios f(y_1;z,u)/f(y_0;z,u), differences f(y_1;z,u)-f(y_0;z,u), or more complex contrasts among the f(y_x;z,u) such as odds ratios, except as conveniences under (always questionable) models in which they happen to reduce to parameters in the models, or to exponentiated parameters. These parametric models are simply artifices that we use to smooth down or reduce the dimensionality of the data and fill in the many blanks in our always-sparse information about f(y_x;z,u), a function which we never observe for more than one x per u.
Parametric smoothing can reduce variance at the cost of bias, a trade-off that varies quite a bit across target quantities. Background substantive theory (as Norris calls for) - or as I prefer to call it, contextual information - can help choose a smoothing model that does better with this tradeoff than the usual defaults. But one should not confuse minimizing loss for estimating a measure or model parameter with minimizing loss for estimating a practical target.
In this regard, a central point in Rothman et al. AJE 1980 can now be restated as this: Often, when the target outcome is risk (not log risk or log odds), the z-specific causal risk differences (cRD) are proportional to the estimated loss differences computed from f(y_x;z,u), which makes the cRD the most relevant summaries in those situations. This does not however justify additive-risk models for estimation, nor does it justify failing to compute cRD as a function of z, rather than as some (often absurd) “common risk difference”. Without solid contextual information to do better, smoothing will often best be based instead around shrinkage toward a loglinear model that will automatically obey logical range restrictions and have estimates that can be easily and accurately calibrated in typical settings.
To conclude: Effect summarization seems essential to communicate results, but also seems to mislead by being taken as the ultimate, proper, or only role for estimation. As a prime example, the effect-measure debate falls prey to replacing a difficult but core practical question - what is f(y_x;z,u)? - with mathematically more tractable oversimplifications and exclusive focus on pros and cons of effect summaries, such as contrasts of x-specific averages over the f(y_x;z,u). This focus fails to directly answer the original question or even recall that f(y_x;z,u) is what we ultimately need for rationally answering practical questions, like “what treatment x should we give to patient u or group u whose known characteristics are z?”
This formulation of the problem seems equivalent to the approach to statistical inference recommended by Seymour Geisser I discussed in this thread:
On the first two pages of chapter 1 of Predictive Inference: An Introduction, He goes on to summarize a number of your cautions in one lengthy paragraph (bolded sections are my emphasis):
Blockquote
…This measurement error convention, often assumed to be justifiable from central limit theorem considerations…was seized upon and indiscriminently adopted for situations where it is dubious at best and erroneous at worst. … applications of this sort regularly occur … much more frequently in the softer sciences. The variation here is rarely of the measurement error variety. As a true physical description of the statistical model used is often inappropriate if we stress testing and estimation of “true” entities, the parameters. If these and other models are considered in their appropriate context they are potentially very useful, i.e. their appropriate use is as models that can yield approximations for prediction of further observables, presumed to be exchangeable in some sense with the process under scrutiny. Clearly hypothesis testing and estimation as stressed in almost all statistics books involve parameters. Hence this presumes the truth of the model and imparts an inappropriate existential meaning to an index or parameter. Model selection, contrariwise is a preferable activity, because it consists of searching for a single mode (or a mixture of several).that is adequate for the prediction of observables even though it is unlikely to be the true one. This is particularly appropriate in the softer areas of application, which are legion, where the so-called true explanatory model is virtually so complex as to be unattainable…
Geisser later points out (page 3):
Blockquote:
As regards to statistical prediction, the amount of structure one can reasonably infuse into a given problem could very well determine the inferential model, whether it be frequentist, fiducialist,
likelihood, or Bayesian. Any one of them possesses the capacity for implementing the predictive
approach, but only the Bayesian mode is always capable of producing probability distributions for prediction.
Given that background, what am I to make of the criticisms of statisticians by causal inference
proponents? It seems to me none of their criticisms are relevant to either a Bayesian approach
or the frequentist approach recommended by @f2harrell in Regression Modelling Strategies. At best,
they criticize a practice of statistics that no one competent would advocate.
For illustration you could take any paper that assumes f(y_x;z,u) = E(y;x,z,u) or equivalent (e.g., that z is sufficient to “control bias” in the sense of forcing this equality) and then illustrates fitting of E(y;x,z,u) with an example. A classic used when I was a student was Cornfield’s fitting a logistic model and then using that model with x fixed at a reference level across patients to compute risk scores of patients. See for example p. 1681 of Cornfield, ‘The University Group Diabetes Program: A Further Statistical Analysis of the Mortality Findings’, JAMA 1971. His is a purely verbal description, there being no established notation back then. In modern terms what he was doing there is arguing against there being much confounding in the trial because the distribution of the fitted E(y;0,z,u) - his proxy for the f(y_0;z,u) distribution - was similar in the treated and untreated groups.
More recent examples in modern notation can be found under the topics of g-estimation and, especially, finding “optimal” treatment regimes. Once the fitted function f(y_x;z,u) is in hand, focus then turns to external validity: whether z is sufficient for transporting the function to a new target population beyond those studied. This problem is more general and difficult than the internal validity issue of whether the fitted potential outcomes can be transported across the treatment or exposure groups within the study - see p. 46 of the Hernan-Robins book for a discussion.
There are now many papers on finding functions for making treatment choices (“optimizing” treatment regimes) and transporting those functions. I haven’t even begun to read them all and would not claim to be able to recommend one best for your purposes. Nonetheless, most I see are illustrated with real examples. One clearly centered around individual patient choices is Msaouel et al. “A Causal Framework for Making Individualized Treatment Decisions in Oncology” in Cancers 2022, which I believe was posted earlier in this thread. One could object to its use of additive risk models and risk differences, but the same general framework can be used with other models.
Regardless of the chosen smoothing model, one can display the risk estimates directly instead of their differences. Survival times and their differences might however be more relevant than risks (probabilities). In either case, the use of differences is defensible to the extent the chosen differences are proportional to loss differences, which I am pretty sure is far more often for risk and survival-time differences than for odds ratios (I’ve yet to see a real medical example in which odds ratios are proportional to actual loss differences).
R^3: I don’t see where Geisser or others making similar arguments (there have been many) address the key distinction between the causal (potential-outcome) function f(y_x;z,u) and the purely predictive (regression) function E(y;x,z,u). That is the core criticism, as I see it. The Bayesian framework does not include this key component so it has to be added on; this gap in the framework may explain why some Bayesians failed to understand the role of randomization. Interestingly, apparently in e-mails (with Pearl) toward the end of his life, Lindley recognized and conceded the need for causal extension.
In contrast, starting in the 1920s frequentists developed the necessary language for that component as it arose naturally from randomization theory, so it is rather startling how it failed to take firm hold until recent decades. Yes, experienced, intuitively smart statisticians got by without causal formalisms, and “causal inference” can be framed as a prediction problem, enabling the vast toolboxes of statistical prediction to be applied (whether frequentist, Bayesian, hybrids, etc.), e.g., see ,
Greenland, S. (2012). Causal inference as a prediction problem: Assumptions, identification, and evidence synthesis. Ch. 5 in: Berzuini, C., Dawid, A.P., and Bernardinelli, L. (eds.). Causality: Statistical Perspectives and Applications. John Wiley and Sons, Chichester, UK, 43-58.
But methodologies still need to make the fundamental distinction encapsuled as “correlation is not causation” in order to derive sound algorithms for making decisions.
I argue further that any sound statistical algorithm needs an explicit causal foundation, even if it is only a survey method, because all studies need to consider causes of observation selection and missing entries:
Greenland, S. (2022). The causal foundations of applied probability and statistics. Ch. 31 in: Dechter, R., Halpern, J., and Geffner, H., eds. Probabilistic and Causal Inference: The Works of Judea Pearl. ACM Books, no. 36, 605-624, Probabilistic and Causal Inference:The Works of Judea Pearl | ACM Books, corrected version at [2011.02677] The causal foundations of applied probability and statistics
I think recognition of this need for explicit causal models is one way the AI/computer-science literature pulled ahead of the statistics literature in the 1990s-2000s.
Appreciate the reference. I am stunned by how elegantly you have been summarizing such complex topics in these last few posts. And you even predicted our upcoming project which is indeed to apply the framework using more flexible models. It is foreshadowed in the parts where we discuss Bayesian nonparametric modeling. Having said that, because we do not always have the data or the methodology available to apply such flexible modeling for many clinical decisions, I do indeed also think that the research done by @AndersHuitfeldt and others can be important for practical applications in medicine. My intuition is that his insistence on anchoring relative treatment effect summary measures to potential biological / mechanistic interpretation will pay off.
Isn’t an RCT directly transportable to clinical settings? If an RCT shows ATE>0 , then, on average, the treatment has a positive effect. That seems like useful information in a clinical setting
The problem from a doctor’s point of view is that there varying degrees of disease severity, so that when the result of an RCT is applied to an individual patient the risk difference (AKA absolute risk reduction) has to take this into consideration. This means that baseline numerical test results that provide a measure of disease severity for the treatment and control groups have to be available from the RCT in order to make the above assessment.
By choosing treatment, confounding can occur.
A doctor advising a patient whether to accept or choose treatment will take into account not only the severity of the condition but also what dose to take. Both affect the intended effect of treatment but also its adverse effects. I can’t think offhand of an example where this process results in confounding, when the choice and ‘doing’ of treatment results in an increased frequency of a desired outcome over and above what would be expected from a RCT.
Why couldn’t this excess death rate be due to poor choices of treatment in the observational setting?
I suppose that this could happen if a drug was given inappropriately (e.g. giving intravenous fluids to someone in congestive heart failure already overloaded with fluid which would be perverse and malpractice). Excess death due to a drug would be by definition an adverse effect or ‘harm’. All treatment can cause benefit and harm but by different causal mechanisms. One cannot prove benefit or harm in an individual because as I understand it, many may experience a good outcome anyway without a drug (e.g. no MI without a statin) and many will also suffer what appears to be a drug’s adverse outcome without taking the drug (e.g. muscle pain without taking a statin).
Context is important so I would have thought any discussion about cause and effect must take place with detailed understanding of the biological mechanisms including the multiple feedback mechanisms involved. The result of such reasoning will always be a hypothesis that has to be tested by experimentation (i.e. RCTs, observational studies etc.), the results of which will be uncertain due to limited data and stochastic processes. I have experimented with fitting parametric distributions, splines and logistic regression functions and then placing confidence intervals on the estimated probabilities of outcomes conditional on disease severity, treatment and control and also confidence intervals on the differences between these probabilities.
One way to connect these would be to take for example a patient who has prostate cancer. X is the choice between a treatment or control (control here would be surveillance, i.e., no treatment). Y is the outcome of interest, which here can be overall survival time (expressed for example as posterior mean overall survival probability). U denotes the specific patient for whom we are called upon to make the choice. Z is a vector of all his covariates that can influence Y. Thus Z includes disease severity, but also other considerations such the patient’s comorbidities as well as biomarkers on the tumor cells reflecting the mechanisms targeted by the treatment. Small letters denote specific values of these variables.
While disease severity is certainly an important consideration when estimating the effect of X=x in Y, it is not enough. For example, the patient may have relatively indolent prostate cancer (disease severity) and severe cardiovascular comorbidities that could actually result in decreased survival time if we choose a therapy such as androgen deprivation therapy (ADT) as opposed to surveillance. His cancer may also harbor mutations in androgen receptor signaling (this is again different than disease severity) that can negate the effect of ADT on the cancer even though it will still yield cardiovascular toxicity.
I think these are important questions and look forward to specific clinical illustrations as answers. One thing I would add is that in your first post tagged above you plotted probabilities in the example from a model and different models (logit or log binomial for example) will return similar probabilities on the same dataset - what will differ is the value of the effect measure in strata with different baseline risks. A dataset can indeed be created that returns a constant RR or OR or RD over strata that differ by baseline risk but predictive margins when plotted from the different models on datasets with different constant effects should be illustrative.
Thank you @Pavlos_Msaouel and @S_doi. I agree that there are many ways of creating predictive models, many of which may give similar results. I also agree that there are a number of ways of measuring disease ‘severity’. However, we have to identify measures of severity that are highly predictive of the chosen outcome and which also identify subjects that will respond the treatment. Instead of using one measure of severity (e.g. the albumin excretion rate (AER) as I do) one could incorporate other measures to form a score such (e.g. the AER, HbA1c etc.) and then compare the predictive power of the calibrated score with each individual measure (e.g. AER and HbA1c). There might well be little difference, for example as the angiotensin receptor blocker (ARB) would have little effect on those diabetic patients with a high HbA1c. The baseline risk might be higher conditional on the placebo, HbA1c and the AER but the risk reduction would no greater than due to the ARB conditional on the AER alone as the ARB would not improve diabetic control.
Another central issue is how well calibrated are the probabilities of the outcome, which I don’t think featured in the recent discussion, including by @Sander_Greenland. This of course begs the question of how we should calibrate. One simple preliminary test of calibration that I use is that (1) the average of all probabilities read from a curve should match the overall frequency of the outcome and (2) the average of all probabilities read from a curve up to some point should match the overall frequency of the outcome up to that point; the same applying to those above the point. If the probabilities are not well-calibrated then we have to derive some function that does make them so. The data used to calibrate then has to be regarded as ‘training’ data’ and the calibrated curve tested on another data set. I think that the issue of calibration needs clarifying.
My journey to understand this thing called “evidence based decision making” started with a vague intuition that the norms taught to me and my colleagues were logically flawed. Since then I’ve been motivated to borrow tools from mathematical and philosophical logic to formalize this beautiful narrative description of principled and honest scientific discourse by Paul Rosenbaum in his book Observational Studies (section 1.3). Here is an excerpt, but the entire short section is worth thinking about.
I’ve learned much from Sander’s (and Frank’s) writings and posts. I credit his presentations on rebuilding statistics upon information theoretic grounds as crucial for scientific understanding.
I find it surprising that there are a large number of scholars who think practicing good science is distinct from (Bayesian) decision theoretic considerations. As a first order approximation of a rational scientific actor, an agent who attempts to maximize the information from “experiments” (defined to include observational studies) seems like a good starting point.
I acknowledge this is a minority position, but after much study, I have to disagree with the causal
inference scholars who claim probability is not a strong enough language with which to express
causal concepts. There were some interesting Twitter threads (now deleted, sadly) where Harry Crane and Nassim Taleb challenged Pearl on his position that causation is outside of standard statistical inference.
Causal inference is closely related to exchangeability, and disagreements about
study design are better discussed in terms of what factors render the groups being considered not
exchangeable.
Causal inference is just inference.[1] A community of scholars can be modeled as a group
of bettors; those who have the best models of future observations (in the sense their forecast enable
them to win more than they lose on a consistent basis). Converging to the best causal model ends
the process of betting on outcomes, unless someone finds an anomaly worth betting on, of course.
Possession of good causal models enable one to be like the gambler in JL Kelly’s paper A New Interpretation of the Information Rate.
This comment from Daniel Lakeland in that thread sums up my attitude on this elegantly:
Blockquote
I found it very frustrating to talk with Pearl regarding these issues (there was a long exchange between us on this blog about 3 or 4 years back), because I came to the conclusion just as you have that his understanding of what is probability theory and statistics is entirely frequentist … and my understanding was Bayesian… and so we talked past each other… He even acknowledged knowing about the development of the Cox/Jaynes theory of probability as extended logic, but seemed to gloss over any actual understanding of it.
The outcome in this case is a continuous variable or categorical with bins for ranges of severity. Expected values comprise this ATE: E[Y_1 - Y_0]. I understand that a physician will still want more information from the studies. Separating the ATE would seem to be useful: E[Y_1] and E[Y_0]. Nevertheless, if no other information is available, an RCT’s ATE result would seem to be useful to some degree.
One simple example of a confounder would be income. A patient with high income can afford a treatment recommendation and also is less likely to suffer from low income situations like malnutrition or financial stress. This can increase the likelihood of doing the treatment and of a positive outcome. This particular confounder may be weak and a situation may just not have any strong confounders. Then the data from the observations would not contribute much to probabilities of causation such as P(\text{benefit}) and P(\text{harm}).
Confounding in observational studies is good when estimating probabilities of causation. But a good observational study may not be feasible or available.
True, we can’t go back in time and withhold a drug that was administered (or vice-versa) to see what the two results would have been for an individual. We can only estimate ranges of probabilities of how a person will react, with and without treatment.
Yes, ideally we encapsulate these biological mechanisms in the form of a DAG and possibly functional mechanisms between variables (or at least constraints on these functions). However, in the absence of biological mechanisms, we can still compute probability ranges on probabilities of causation. Those ranges might be looser/wider without the expert knowledge, but they can often be narrow enough to make good decisions from.
Upon reading your sentence again, I think I misunderstood. With disease severity, I thought you were referring to the outcome, but now I think you were referring to a covariate. In this case, what you’d want is a Conditional Average Treatment Effect (CATE): E[Y_1 - Y_0|S], where S is disease severity. So the RCT results would either be grouped by S or a function would be fit with S as input.
Thank you @Scott for answering my questions. You are proposing different ways of reasoning with medical knowledge by invoking some principles of causal inference. @R_cubed points out that there these methods are debatable. There is also of course much disagreement amongst medical scientists when they advance alternative hypotheses based on different theories and background knowledge. The way forward of course is to conduct RCTs and observational studies to test these hypotheses to find out what happens in practice and to calibrate probabilities arrived at from RCTs with or without causal inference.
I remain concerned about the different advice that you and I would give a patient based on the result of the RCT and observational study described in your paper. You would assure a female patient that no harm can occur by choosing to take an ‘over-the counter’ drug, whereas I would warn her that it is unsafe unless taken with the same close supervision as during the RCT. I would be grateful if you could explain why we would give this different advice.
My numbers and hypothetical advice assume that the patient would administer treatment properly (at the same level as was done in the RCT). Of course, if there’s a risk of the patient not administering treatment properly then that has to be taken into account in the advice and discussion about the treatment.
But the observational study shows clear evidence that more die after choosing to take the drug than die after choosing not to take it. This implies that allowing the patient to choose ’causes’ many more to die (due to an adverse effect from one causal mechanism such as not taking the drug properly etc) that are ‘saved’ (by some other causal mechanism suggested by the RCT). Accordingly you surety cannot assume that such an adverse effect will not happen again and then conclude that a future patient choosing to take the drug has a zero probability of dying from the adverse effect. In the light of the observational study, the FDA would surely refuse a license for the proposed use (ie by allowing the patient to choose). There seems to be a divergence here between our causal inference processes! It seems that amongst other things, you may not be allowing for the possibility of more than one causal mechanism happening ‘in parallel’ at the same time.
I found a number of excellent videos on this dispute about causal inference that should be of interest.
Speaker bios
The presenters include: Larry Wasserman giving the frequentist POV, Philip Dawid, and Finnian Lattimore giving the Bayesian one.
The ones I’ve made it through so far: Larry Wasserman - Problems With Bayesian Causal Inference
His main complaint is that Bayesian credible intervals don’t necessarily have frequentist coverage. I’d simply point out that from Geisser’s perspective: in large areas of application, there is no parameter. Parameters are useful fictions. This is evident in extracting risk neutral implied densities from options prices (ie. in a bankruptcy or M&A scenario). But he discusses how causal inference questions are more challenging from a computational POV than the more simple versions seen assessment of interventions.
Philip Dawid - Causal Inference Is Just Bayesian Decision Theory
This is pretty much my intuition on the issue, but Dawid shows how causal inference problems are merely an instance of the more general Bayesian process and equivalent to Pearl’s DAGs.with the assumption of vast amounts of data.
Finnian Lattimore - Causal Inference with Bayes Rule
Excellent discussion on how to do Bayesian (Causal) inference with finite data. I’m very sympathetic to her POV that causal inference isn’t all that different from “statistical” inference, but one participant (Carlos Cinelli) kept insisting there is some distinction between “causal inference” and decision theory applications. They seem to insist that modelling the relation among probability distributions is meta to the “statistical inference” process, while that step seems naturally part of it to a Bayesian.
Panel Discussion: Does causality mean we need to go beyond Bayesian decision theory?
The key segment is the exchange between Lattimore and a Carlos Cinelli from 19:00 mark to 24:10
From 19:00 to 22:46, Carlos made the claim that “causal inference” is deducing the functional relation of variables in the model. Statistical inference for him is simply an inductive procedure for choosing the most likely distribution.from a sample. He calls the deductive process “logical” and in an effort to make Bayesian inference appear absurd, he concludes that “If logical deduction is Bayesian, everything is Bayesian inference.”
From 22:46 to 23:36 Lattimore describes the Bayesian approach as a general procedure for learning, and uses Planck’s constant as an example. In her description, you “define a model linking observations to the Planck’s constant that’s going to be some physics model” using known physical laws.
Cinelli objected to characterizing the use of physical laws to specify variable constrains when asking:
Cinelli: Do you consider the physics part, as part of statistics? Lattimore: Yes. It is modelling. Cinelli: (Thinking he is scoring a debate point) But it is a physical model, not a statistical model… Lattimore: (Laughing at the absurdity) It is a physical model with random variables. It is a statistical model.
Lattimore effectively refuted the Perlian distinction between “statistical inference” and “causal inference.” Does “statistical mechanics” cease to be part of physics because of the use of “statistical” models?
The Perlian CGM school of thought essentially cannot conceive of the use of informative Bayesian inference, which physicists RT Cox and ET Jaynes advocated.
Another talk by David Rohde with the same title of the paper mentioned above. This was from March 2022.
David Rohde - Causal Inference is Inference – A beautifully simple idea that not everyone accepts
There are still interesting questions and conjectures I got from these talks:
Why do frequentist procedures, which ignore information in this context, do so well?
How might a Bayesian use frequentist methods when the obvious Bayesian mechanism is not
computable?
How does the Approximate Bayesian Computation program (ABC) relate to frequentist methods?
This paper by Donald Rubin sparked research into avoiding the need for intensive likelihood computations that characterize ABC.