The relation between Statistical Inference and Causal Inference

A deep philosophical question came up in the OR vs RR thread.
@AndersHuitfeldt wrote:

Blockquote
My personal view is that trading off bias for precision is “cheating” and gives you uninterpretable inferential statistics. In my opinion, the correct move is to accept that the standard errors will be large unless we can increase sample size. I would rather have an honest but imprecise prediction.

As far as I know, trading bias for variance is rational behavior from the perspective of decision theory. Gelman wrote a short post about the bias-variance tradeoff:

Blockquote:
My second problem is that, from a Bayesian perspective, given the model, there actually is a best estimate, or at least a best summary of posterior information about the parameter being estimated. It’s not a matter of personal choice or a taste for unbiasedness or whatever … For a Bayesian, the problem with the “bias” concept is that is conditional on the true parameter value. But you don’t know the true parameter value. There’s no particular virtue in unbiasedness.

Later on he wrote:

Blockquote
I know that it’s a common view that unbiased estimation is a good thing, and that there is a tradeoff in the sense that you can reduce variance by paying a price in bias. But I disagree with this attitude.
To use classical terminology, I’m all in favor of unbiased prediction but not unbiased estimation. Prediction is conditional on the true theta, estimation is unconditional on theta and conditional on observed data.

I don’t believe there is any contradiction between statistical theory and the causal calculus. Hopefully this thread can show the relationship between the two formal theories.

The Wikipedia link has a decent reference section. Not so sure about the body of the entry, though.

2 Likes

I think the first thing to note is that the term “bias” means different things to statisticians and causal inference methodologists:

  • To statisticians, bias means that the expected value of an estimator (over the sampling distribution) is not equal to the true value of the parameter
  • To causal inference methodologists, bias means that the observational parameter is not equal to the causal estimand

These are not the same concepts. For example, in an observational study with confounding bias, the sample probability may be an unbiased estimator of Pr(Y| A) (meaning there is no statistical bias), but the true population Pr(Y| A) is not equal to Pr(Y(a)) due to confounding bias (non-identificatiability of the counterfactuals).

For commonly used estimators, statistical bias is generally well-behaved, well-understood, quantifiable and small. In contrast, causal bias is much messier and much more problematic.

My impression is that when statisticians talk about the bias-variance tradeoff, they generally have the statistical definition of bias in mind. Their arguments may well make sense for this type of bias, but causal bias is a completely different beast. Using efficiency gains as a justification for allowing causal bias is just terrifyingly wrong.

In practice, standard errors are used to quantify uncertainty due to sampling variability. There is something really messed up about allowing bias in the analysis because we would like to pretend that the uncertainty due to sampling error is lower than what it actually is.

If I can use a Bayesian metaphor in a frequentist discussion: Given a choice between having a well-calibrated posterior and a highly informative (but uncalibrated) posterior, what would you choose? Would you really trade off calibration in order to reduce the variance of the posterior?

4 Likes

Since no one has yet taken a shot at responding, I’ll offer the following thoughts on how to decide this type of question in a principled fashion.

I don’t see a sharp distinction between “causal” models vs “statistical” models. For the purposes of this discussion, I will take the position of Seymour Geisser, who explored Bayesian predictive models sometime in the 1970’s (if not earlier).

From his book Predictive Inference: an Introduction:

Blockquote
Currently most statistical analyses generally involve inferences or decisions about parameters or indexes of statistical distributions. It is the view of the author that analyzers of data would better serve their clients if inferences or decisions were couched in a predictivistic framework.

In this point of view, “parameters” are useful fictions to help in the development of prediction models.

If this is agreed to, that places both “causal” models and “statistical” models within the underlying framework of information theory, where they can be compared on the accuracy of predictions to actual observations.

In this predictive framework, there is training data and test data. Bias in the predictive framework can be seen as function of model complexity. Simple models will not fit the training data all that well.

But complexity is not always a good thing. One can always fit a function to a data set, but there is no value to the fit, as it doesn’t predict future observations well.

In my copy of Elements of Statistical Learning the authors write:

Blockquote
Training error tends to decrease whenever we increase model complexity [reduce bias my emphasis]…However, with too much fitting, the model adapts itself too closely to the training data, and will not generalize well (have large test error). In that case the predictions will have a large variance. In contrast, if the model is underfit (high bias) again resulting in poor generalization.

This goes contrary to your statement in the other thread:

Blockquote
My personal view is that trading off bias for precision is “cheating” and gives you uninterpretable inferential statistics.

Models that engage in this trade lead to predictions of future data. I don’t know how to interpret a model other than its predictive capabilities.

In my predictive inference philosophy, a good causal model can be expressed as constraints on probability distributions that reduce model complexity, or lead to better predictions relative to model complexity. Complex models in essence “pay for themselves.”

Statistical models are evaluated based on their ability to predict if all variables take their natural value. Causal models make predictions under intervention, i.e. counterfactual predictions.

There is a sharp distinction between counterfactual prediction and standard prediction, and it needs separate terminology. The kind of bias I am talking about just cannot be defined in observational statistical terms. An ontology based only on observables just isn’t rich enough to capture the underlying concepts.

3 Likes

Blockquote
Statistical models are evaluated based on their ability to predict if all variables take their natural value. Causal models make predictions under intervention, i.e. counterfactual predictions.

This is hard for me to accept when large amounts of thought went into the design of experiments by Fisher to show a particular intervention caused an effect (as in regression models), as well as the debate on smoking as a cause of lung cancer, that was ultimately decided by overwhelming observational evidence.

People have been concerned about causation far longer than the formal causal calculus has existed.

But that is far from my initial point. All models are constrained by the framework of information theory, and we can use information theory to evaluate models, regardless of their source. Do you agree or disagree with this framework?

2 Likes

I don’t really know enough about information theory to answer that. But there is a difference between evaluating the mutual information between the exposure and a counterfactual, and evaluating the mutual information between the exposure and an observable. I can say confidently that information theory does not allow you to bypass this.

The following paper might interest you in that it discusses causal models in the context of Bayesian inference. Causal inference is connected to Bayesian inference via the fundamental notion of exchangeability.

1 Like

New paper in JAMA that introduces causal graphs; I saw it mentioned by @sander on twitter.
Lipsky, A. M., & Greenland, S. (2022). Causal Directed Acyclic Graphs. JAMA. link

A more fundamental intro on algorithmic information theory by Nick Szabo that is worth preserving can be found below. His main scholarship is far afield from medical research, but much closer to another interest of mine – monetary economics and insurance.

Causal inference is directly linked to algorithmic information theory via the concept of “minimum description length.” A correct causal model will produce an estimate of future data (assuming sample size is adequate) better than any model that omits the causal factor. It will also balance parsimony (descriptive complexity) with predictive accuracy. Proceeding as Geisser recommends above (and building upon JL Kelly’s paper A New Interpretation of the Information Rate (pdf), those who understand causal factors will win bets against those who do not.

Nick Szabo (1996) Introduction to Algorithmic Information Theory

Blockquote
The combinations that can arise in a long series of coin flips can be divided into regular sequences, which are highly improbable, and irregular sequences, which are vastly more numerous. Wherever we see symmetry or regularity, we seek a cause. Compressibility implies causation.

3 Likes

Interesting, does this imply that the model which is best for causality will also be the model that has the best predictive accuracy given the same features? Ie assuming you have the necessary confounders, other causes but that may not be confounders, and no colliders/mediators then the “best” model for both (causal) inference and prediction overlaps?

See if something like this is the case, then it goes against the inference vs prediction dichtomy. Lot of stat & DS profs emphasize the differences between prediction & inferential models but if we are to go by the newer causal inference stuff with DAGs, G-methods and info theory it seems to be going against this traditional viewpoint. Even with the G-methods you can do inference on black-box models provided the set of variables you choose is correct.

There is still a distinction between inference and prediction that Efron describes here:

Efron, Bradley. “Prediction, estimation, and attribution.” International Statistical Review 88 (2020): S28-S59. (link)

I can’t see this paper as I am not on a college campus but I think I have seen this before. My argument against it is outside of, for example, pure chem and physics, which are more deterministic (with random error often being due to only the measurement equipment like an NMR scanner or spectroscopy, potentially even the quantum level, which is very low in the grand scheme of things) we never know the true data generating process (DGP). How can you ensure your inference is accurate in spite of model specification issues in the structural equations?

Absent of the true DGP, it makes sense to combine domain knowledge via the DAG and pure prediction algorithms, to do both causal inference and prediction. But fitting a naive linear additive (in x) model as is often done is not the solution. Somehow everybody in applied fields knows about normality but ignores linearity which is far more important (and Dr.Harrell tries to address this with splines, ML and DL is just an extension of all this, one of the theories is that a conv net with ReLU activations is doing dimension reduction+piecewise linear spline interpolation).

Mechanistic models will always be preferred, but as a statistician, can you really come up with a mechanistic (and thus parametric) nonlinear model? For example Michaelis-Menten kinetics. That model is often taught in biochem as “invert the X and Y” without paying attention to the issues with the error. That’s where the statistician would come in and suggest for example a Normal GLM with inverse link or NLS rather than transformed OLS. This model is well established but otherwise for novel stuff (what if you had multiple enzymes in a network?) its something to work out with domain experts, and then you as a statistician would deal with the model optimization & interpretation aspects.

In fact, there are some people working on combining neural nets with domain knowledge based models in the PK/PD space: Pumas-AI is doing that: https://pumas.ai/company/careers/. From what I understand they are using NNs for the “unknown” components and combining them with standard nonlinear mixed models as an inductive bias to incorporate domain knowledge.

Increasingly, in my opinion, prediction and inference are converging rather than being vastly different. I think a lot of the Breiman esque two cultures stuff is increasingly outdated going forward as these methods are developed.

I found a number of excellent videos on this dispute about causal inference that should be of interest.
They were initially posted in this thread:

Videos

Speaker bios

The presenters include: Larry Wasserman giving the frequentist POV, Philip Dawid, and Finnian Lattimore giving the Bayesian one.

The ones I’ve made it through so far:
Larry Wasserman - Problems With Bayesian Causal Inference

His main complaint is that Bayesian credible intervals don’t necessarily have frequentist coverage. I’d simply point out that from Geisser’s perspective: in large areas of application, there is no parameter.

Parameters are useful fictions. This is evident in extracting risk neutral implied densities from options prices (ie. in a bankruptcy or M&A scenario). But he discusses how causal inference questions are more challenging from a computational POV than the more simple versions seen assessment of interventions.

Philip Dawid - Causal Inference Is Just Bayesian Decision Theory

This is pretty much my intuition on the issue, but Dawid shows how causal inference problems are merely an instance of the more general Bayesian process and equivalent to Pearl’s DAGs.with the assumption of vast amounts of data.

Finnian Lattimore - Causal Inference with Bayes Rule

Excellent discussion on how to do Bayesian (Causal) inference with finite data. I’m very sympathetic to her POV that causal inference isn’t all that different from “statistical” inference, but one participant (Carlos Cinelli) kept insisting there is some distinction between “causal inference” and decision theory applications. They seem to insist that modelling the relation among probability distributions is meta to the “statistical inference” process, while that step seems naturally part of it to a Bayesian.

Panel Discussion: Does causality mean we need to go beyond Bayesian decision theory?

The key segment is the exchange between Lattimore and a Carlos Cinelli from 19:00 mark to 24:10

From 19:00 to 22:46, Carlos made the claim that “causal inference” is deducing the functional relation of variables in the model. Statistical inference for him is simply an inductive procedure for choosing the most likely distribution.from a sample. He calls the deductive process “logical” and in an effort to make Bayesian inference appear absurd, he concludes that “If logical deduction is Bayesian, everything is Bayesian inference.”

From 22:46 to 23:36 Lattimore describes the Bayesian approach as a general procedure for learning, and uses Planck’s constant as an example. In her description, you “define a model linking observations to the Planck’s constant that’s going to be some physics model” using known physical laws.

Cinelli objected to characterizing the use of physical laws to specify variable constrains when asking:

Cinelli: Do you consider the physics part, as part of statistics?
Lattimore: Yes. It is modelling.
Cinelli: (Thinking he is scoring a debate point) But it is a physical model, not a statistical model…
Lattimore: (Laughing at the absurdity) It is a physical model with random variables. It is a statistical model.

Lattimore effectively refuted the Perlian distinction between “statistical inference” and “causal inference.” Does “statistical mechanics” cease to be part of physics because of the use of “statistical” models?

The Perlian CGM school of thought essentially cannot conceive of the use of informative Bayesian inference, which physicists RT Cox and ET Jaynes advocated.

Another talk by David Rohde with the same title of the paper mentioned above. This was from March 2022.

David Rohde - Causal Inference is Inference – A beautifully simple idea that not everyone accepts

Causal Inference with Bayes rule with Finnian Lattimore and David Rohde

There are still interesting questions and conjectures I got from these talks:

  • Why do frequentist procedures, which ignore information in this context, do so well?
  • How might a Bayesian use frequentist methods when the obvious Bayesian mechanism is not
    computable?
  • How does the Approximate Bayesian Computation program (ABC) relate to frequentist methods?

This paper by Donald Rubin sparked research into avoiding the need for intensive likelihood computations that characterize ABC.

A relatively recent primer on ABC

Here are some links to papers I posted in this thread:

5 Likes

Nick’s bolded statement is too strong for mere compressibility, but it is accurate when taken to the limit of Kolmogorov Complexity (ie, in the context of Algorithmic Information), and profoundly so.

Here’s what I mean:

Let’s say you have a scatter plot showing a linear correlation between two variables. The old Statistics 101 saw “Correlation doesn’t imply causation” holds if that is all you have: The corresponding data can be compressed by simply subtracting the predicted value for the dependent variable yielding a smaller magnitude residual error which can be represented in fewer bits than the original dependent value.

However, and this is where things get profound, all of our scientific models that predict observations start as merely noticing such “correlations” – which is to say they start as statistics – and, upon further compression as algorithms simulating the generative processes of reality turn into time invariant dynamics. Dynamical laws are inherently recursion laws, mapping one data point onto the next data point and our laws of nature map one state of nature onto the next state of nature. We can only speak of causality therefore, in terms of such recursive or dynamical laws.

1 Like

My gut impression is that this explanation is overly complex. I admit to being biased by unfamiliarity with the terms you use.

Perhaps this “Nature Machine Learning Communications” animation will be simpler but I’m afraid it will not be adequately descriptive of the boundary between statistics and dynamics.

Here is an interesting paper that discusses an approach to causal inference via compression using algorithmic information theory.

Paraphrasing from the paper:
Essentially, if X → Y (-> means causes), then to describe (Y|X), it is easier to describe X first, then Y using the principle of minimum description length. The notion that K(X) + K(Y|X) = K(Y) + K(X|Y) does not hold if there is a causal relation. This next paper goes into details and was cited in the article above.

I found a worthwhile video by @AndersHuitfeldt on his work on the switch risk ratio and causal inference more generally.

The tools for causal inference seem to me as a useful complement to a Bayesian decision theoretic approach to causal inference that the physical scientists in the other videos seem to take for granted.

BEMC FEB 2020 - Anders Huitfeldt - “A New Approach to the Generalizability of Randomized Trials”

Related:

Miguel A. Hernán, John Hsu & Brian Healy (2019) A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks, CHANCE, 32:1, 42-49, DOI: 10.1080/09332480.2019.1579578

1 Like