Is coefficient interpretation unnecessary even for causal inference on obs data (via G-computation)?

I’ve been learning a lot about causal inference recently and its been fascinating. So of course part of causal inference is to identify the proper variable selection/adjustment set based on the DAG. So lets say you have done all that and now the problem is just an estimation problem, and one of the issues is the functional form is unknown.

ML is often criticized a lot for the “black box” models. I’ve been reading a lot about causal inference recently and in many sources I see the “G computation” technique for marginal estimation. Basically:

  1. Fit your model Y = f(X,W) (X is your treatment, W is the adjustment set)

  2. In your dataset, create 2 where you artificially set X=1 and X=0 and make predictions (on the response scale, if there is a link fn like in logistic, Gamma, etc). For continuous X, perturb it by ± eps.

  3. Take the mean difference (or ratio, etc depending on what you want the causal contrast units to be). In the case of continuous X, you can take the difference and divide by 2*eps (basically a numerical derivative)

  4. Use Delta Method (this is done in packages like marginaleffects for GLM models: Marginal Effects, Marginal Means, Predictions, and Contrasts • marginaleffects) or Bootstrap for the uncertainty. If a Bayesian method was used, you automatically have the uncertainty from the posterior of (3).

This procedure seemingly doesn’t even require one to have knowledge of the coefficients/parameters. I could in theory do G-computation on a black-box algorithm.

It seems incredibly freeing too, because now even for GLMs you can include all sorts of stuff like splines, interactions, various transformations, etc etc and regularize the higher order effects and you don’t need to construct a particularly “interpretable” model at all and still can estimate the causal effect via the procedure above. You also don’t even need to worry about “what is a log odds”–I don’t know of anyone who naturally thinks in quantities like ORs (say someone who didn’t know about logistic reg at all) while a Risk Diff or effect of X on probability of Y seems more interpretable anyways.

Is it too good to be true? It seems like the causal inference people really push this approach as the “closest” to the truth and that it is superior to traditional regression.

With regard to ML, there seems to even by the Targeted Maximum Likelihood Estimation (TMLE) method here: About this book | Targeted Learning in R which is more advanced than above, but still does not require the model to be interpretable to estimate the causal effect. Based on my understanding, the idea is that if you get the functional form wrong, your estimate is not causal, so you should use ML to estimate f.

Seems like these recent advances in causal inference are shaking the prediction/inference dichotomy.

I’ve yet to see a write up on the downside of G-methods though so hence I am curious.

There’s a lot to like, and TMLE may be able to get you covariate-specific estimates. The downside to the general approach is that it results in a misleading estimate since the estimate is marginalized over heterogeneous subjects. Risk differences are highly covariate-specific. An average risk difference will not apply to sicker-than-average patients. For this reason it is usually best to show the whole distribution of risk differences and to use explicit functions of covariates. More here and here.

Thanks I’ll take a look at those articles.

I guess what I wonder is the traditional simple approach of assuming additivity also is misleading right?–As some kind of heterogeneity can mostly be assumed to be a given, and for example the extremely common linear in predictors additive model doesn’t capture any of this either and assumes from the start that there is no effect modification. G-comp on models with nonlinearity is at least an improvement over this.

It seems like, at least with G comp you get the 1 number summary but also the option to report individual specific effects and plot the distribution. I don’t see the marginalization over other variables as a huge problem, since, say you knew nothing else about a new individual (not in this study) and just randomly selected them, it seems like your best estimate of the effect of the treatment would be the average marginal effect.

If you happened to know beforehand that this patient was sicker, then you could always calculate a “personalized ATE” just for that individual by plugging their known covariates into the model and marginalizing over any unknowns.

That does kind of answer the “why not just use a black box” though even if you can get CIs these days with modern causal ML methods like TMLE, I guess if you say want to tell them why they are predicted to benefit more than average itself, then you might need to know how the covariates and treatment interact in the model. Though one could also consider bringing in something like SHAP for investigating interactions in an ML model.

No, the best estimate for a new individual is not the average over dissimilar individuals. The best estimate comes from measuring the covariates and conditioning on them. When the treatment interacts with a covariate, marginalization is even more disastrous.

I read those articles and I’ve been thinking about this though in standard RCTs you typically aren’t adjusting anything at all as its not necessary and so inherently the estimate you get is already a marginal effect.

So to me it seems like the same criticism against marginal effects/G comp by extension can be applied to RCTs. The estimated effect in an RCT is not accounting for heterogeneity either.

That’s not the case. You need to adjust for important covariate in RCTs to get the model right if there are strong prognostic factors. If you fail to adjusted for easily adjusted-for outcome heterogeneity you have the ethical problem of randomizing more patients than you need. Marginal effects are not appropriate and are very sample-dependent. Condition effects transport outside the RCT as described in my blog articles.


Is it standard as of now to adjust (or rather just include, since there is no confounding) for other stuff in the primary analysis for RCTs? I thought the primary analysis typically is only just comparing the groups. This would result in a marginal effect.

Often it is said that inference vs prediction is different, but if you truly need to address as many sources of heterogeneity as you can, then at that point its almost like a prediction problem to get the closest function that predicts Y the best, except maybe maintaining some interpretability. In modern stats though the 2 do seem to be merging lately.

The only reason that you see more unadjusted RCTs than covariate-adjusted ones is the prevailing thought that unadjusted estimates are easier to interpret. Those that make this mistake never ask themselves the question “Was it worth it to have 30% larger sample size, needlessly exposing more volunteers to the experimental agent, just so we can dumb down the analysis?”. You won’t see that question asked because it’s too painful.

The primary analysis goal is not just to compare groups. It is to estimate the treatment difference on patients who differ (to the extent of measured covariates) only on assigned treatment to answer the question the treating physician, who doesn’t care about marginal estimates, has: were this patient to be given treatment B, how much better a result would I expect than were she given treatment A? The gold standard is the crossover study, and we try to mimic that in parallel group designs by accounting for some outcome variability. See this for details.