The relation between Statistical Inference and Causal Inference

A deep philosophical question came up in the OR vs RR thread.
@AndersHuitfeldt wrote:

My personal view is that trading off bias for precision is “cheating” and gives you uninterpretable inferential statistics. In my opinion, the correct move is to accept that the standard errors will be large unless we can increase sample size. I would rather have an honest but imprecise prediction.

As far as I know, trading bias for variance is rational behavior from the perspective of decision theory. Gelman wrote a short post about the bias-variance tradeoff:

My second problem is that, from a Bayesian perspective, given the model, there actually is a best estimate, or at least a best summary of posterior information about the parameter being estimated. It’s not a matter of personal choice or a taste for unbiasedness or whatever … For a Bayesian, the problem with the “bias” concept is that is conditional on the true parameter value. But you don’t know the true parameter value. There’s no particular virtue in unbiasedness.

Later on he wrote:

I know that it’s a common view that unbiased estimation is a good thing, and that there is a tradeoff in the sense that you can reduce variance by paying a price in bias. But I disagree with this attitude.
To use classical terminology, I’m all in favor of unbiased prediction but not unbiased estimation. Prediction is conditional on the true theta, estimation is unconditional on theta and conditional on observed data.

I don’t believe there is any contradiction between statistical theory and the causal calculus. Hopefully this thread can show the relationship between the two formal theories.

The Wikipedia link has a decent reference section. Not so sure about the body of the entry, though.


I think the first thing to note is that the term “bias” means different things to statisticians and causal inference methodologists:

  • To statisticians, bias means that the expected value of an estimator (over the sampling distribution) is not equal to the true value of the parameter
  • To causal inference methodologists, bias means that the observational parameter is not equal to the causal estimand

These are not the same concepts. For example, in an observational study with confounding bias, the sample probability may be an unbiased estimator of Pr(Y| A) (meaning there is no statistical bias), but the true population Pr(Y| A) is not equal to Pr(Y(a)) due to confounding bias (non-identificatiability of the counterfactuals).

For commonly used estimators, statistical bias is generally well-behaved, well-understood, quantifiable and small. In contrast, causal bias is much messier and much more problematic.

My impression is that when statisticians talk about the bias-variance tradeoff, they generally have the statistical definition of bias in mind. Their arguments may well make sense for this type of bias, but causal bias is a completely different beast. Using efficiency gains as a justification for allowing causal bias is just terrifyingly wrong.

In practice, standard errors are used to quantify uncertainty due to sampling variability. There is something really messed up about allowing bias in the analysis because we would like to pretend that the uncertainty due to sampling error is lower than what it actually is.

If I can use a Bayesian metaphor in a frequentist discussion: Given a choice between having a well-calibrated posterior and a highly informative (but uncalibrated) posterior, what would you choose? Would you really trade off calibration in order to reduce the variance of the posterior?


Since no one has yet taken a shot at responding, I’ll offer the following thoughts on how to decide this type of question in a principled fashion.

I don’t see a sharp distinction between “causal” models vs “statistical” models. For the purposes of this discussion, I will take the position of Seymour Geisser, who explored Bayesian predictive models sometime in the 1970’s (if not earlier).

From his book Predictive Inference: an Introduction:

Currently most statistical analyses generally involve inferences or decisions about parameters or indexes of statistical distributions. It is the view of the author that analyzers of data would better serve their clients if inferences or decisions were couched in a predictivistic framework.

In this point of view, “parameters” are useful fictions to help in the development of prediction models.

If this is agreed to, that places both “causal” models and “statistical” models within the underlying framework of information theory, where they can be compared on the accuracy of predictions to actual observations.

In this predictive framework, there is training data and test data. Bias in the predictive framework can be seen as function of model complexity. Simple models will not fit the training data all that well.

But complexity is not always a good thing. One can always fit a function to a data set, but there is no value to the fit, as it doesn’t predict future observations well.

In my copy of Elements of Statistical Learning the authors write:

Training error tends to decrease whenever we increase model complexity [reduce bias my emphasis]…However, with too much fitting, the model adapts itself too closely to the training data, and will not generalize well (have large test error). In that case the predictions will have a large variance. In contrast, if the model is underfit (high bias) again resulting in poor generalization.

This goes contrary to your statement in the other thread:

My personal view is that trading off bias for precision is “cheating” and gives you uninterpretable inferential statistics.

Models that engage in this trade lead to predictions of future data. I don’t know how to interpret a model other than its predictive capabilities.

In my predictive inference philosophy, a good causal model can be expressed as constraints on probability distributions that reduce model complexity, or lead to better predictions relative to model complexity. Complex models in essence “pay for themselves.”

Statistical models are evaluated based on their ability to predict if all variables take their natural value. Causal models make predictions under intervention, i.e. counterfactual predictions.

There is a sharp distinction between counterfactual prediction and standard prediction, and it needs separate terminology. The kind of bias I am talking about just cannot be defined in observational statistical terms. An ontology based only on observables just isn’t rich enough to capture the underlying concepts.


Statistical models are evaluated based on their ability to predict if all variables take their natural value. Causal models make predictions under intervention, i.e. counterfactual predictions.

This is hard for me to accept when large amounts of thought went into the design of experiments by Fisher to show a particular intervention caused an effect (as in regression models), as well as the debate on smoking as a cause of lung cancer, that was ultimately decided by overwhelming observational evidence.

People have been concerned about causation far longer than the formal causal calculus has existed.

But that is far from my initial point. All models are constrained by the framework of information theory, and we can use information theory to evaluate models, regardless of their source. Do you agree or disagree with this framework?


I don’t really know enough about information theory to answer that. But there is a difference between evaluating the mutual information between the exposure and a counterfactual, and evaluating the mutual information between the exposure and an observable. I can say confidently that information theory does not allow you to bypass this.

The following paper might interest you in that it discusses causal models in the context of Bayesian inference. Causal inference is connected to Bayesian inference via the fundamental notion of exchangeability.

1 Like