Propensity score weights

s_doi · November 18, 2025, 6:21pm

Considering ATE, ATT, ATO and ATM weights, it is commonly said that a weighted estimator (ATE/ATT/ATO/ATM) targets a marginal estimand — an average over a specified population (whole population, treated, overlap, matchable). The weights make explicit which marginal population we are averaging over.

My problem is that these are all averaged over differences in covariate distributions that are prognostic for the outcome and therefore all they differ by is the extent of what Frank calls risk magnification which is a form of HTE. Would you agree with this?

ngreifer · November 18, 2025, 9:17pm

I agree, and in general, I agree with Frank’s critiques against marginal estimands, which are all that can be targeted with PS weighting methods. Unless a sample is nationally representative, a marginal estimand averages over a potentially meaningless population. Also, marginal estimands ignore (i.e., average over) effect heterogeneity, and they can’t be used for clinical decision-making because (for binary outcomes at least) they refer only to population effect measures, which do not coincide with individual or conditional effect measures due to the problem of noncollapsibility (even in the absence of effect heterogeneity over a given effect measure).

For measures like the risk difference and risk ratio, marginal estimands are inherently problematic for clinical decision-making because, by definition of the effect measure, there is heterogeneity that is missed by the marginal estimand. A marginal risk ratio of 2 obviously can’t apply to a patient who already has a baseline risk over .5, and a marginal risk difference of .3 obviously can’t apply to a patient who has a baseline risk over .7. Even though a marginal odds ratio doesn’t have this specific problem, it also cannot generalize to any individual because it is not collapsible; the marginal odds ratio will generally be smaller than conditional (or individual) odds ratios, even when there is no effect heterogeneity on the odds ratio scale. Note that all this is in even the most optimistic scenario, in which we have a full representative sample (or census) and not just patients from a given hospital. I will never claim a marginal estimand is always the right or even a remotely useful estimand to target, and if it isn’t, then PS methods (as one’s sole adjustment method) are not the right methods for that application. Similarly, the choice of which marginal estimand to target is meaningless when all marginal estimands suffer from the same problems.

Let’s say you decide you do want a marginal estimand (i.e., for reasons of interpretability or statistical limitations). You have patients from one hospital. How do you decide among the available estimands? Typically (and in my paper on estimands), you might think that you’d choose the ATE if you are interested in a universal policy that generalizes to all units in the population under study. Putting aside the question of whether that is even useful in medical research, there is the question of who the study population is and why it’s so important to generalize to them. If the sample from a hospital represents only that hospital, what is the value of an estimand that generalizes only to patients like those in that hospital, when the result is to be presented as fact to inform practice at large? Given the arbitrariness of the target population, why not just target an estimand that can be estimated with more precision and is closer to the actually clinically useful conditional estimands, such as the ATO?

Here are some of the problems that causal inference scholars see as why we haven’t abandoned marginal estimands entirely. First, in order to estimand conditional estimands, whether in an RCT or observational study, you need to get the outcome model right, and you need to include all baseline prognostic factors. If you get the model wrong, the conditional estimates you get are biased for the true conditional effects. That is not the case for marginal estimands. As long as you correctly adjust for just the confounding variables (if any), there are many robust methods to validly estimate the marginal estimand and quantify its uncertainty, such as matching, weighting, doubly robust methods, etc.

Second, the assumption that a conditional effect measure is constant across individuals (or subgroups) is a very strong assumption to make and requires justification that can never come (i.e., accepting the null hypothesis of no effect heterogeneity). Even if it is at least biologically plausible for a relative effect measure like the odds ratio to be constant across individuals, one would be acting with undue omniscience to claim that no factors modify the conditional odds ratio. And yet, such an assumption must be encoded into the outcome model to faithfully represent this fact (unless one performs increasingly narrow subgroup analyses, which I describe below). In contrast, marginal estimands do not require the effect heterogeneity to be modelled to be estimated validly. Frank often calls this a weakness of PS methods (i.e., they average over or ignore heterogeneity), but that is a feature of the estimand itself. Even if you do not model the effect heterogeneity in, for example, a weighting analysis, the weighted estimate can be consistent for the marginal estimand under much weaker modeling conditions than are required for the consistent estimation of the full effect heterogeneity of a conditional estimand across the same population.

Third, even if you decide to move past modeling issues and focus instead on narrower subgroups, eventually you lose so much precision that the inference is meaningless. You may want to say that the effect of the drug vs placebo for Hispanic men between 35 and 45 with a family history of hypertension who concurrently take a statin is some odds ratio X. If the odds ratio is indeed constant across subgroups and you correctly model the outcome with a logistic regression model, you can indeed make this claim (and indeed a much stronger claim, that it is equal to X for all subgroups defined by those factors). But if you want to retain humility about the strength of your modeling assumptions, you need to estimate an odds ratio nonparametrically in this (and each) subgroup. You can do this with weighting methods by restricting your analysis to just this subgroup. Of course doing so would decimate your precision. Targeting a marginal estimand allows you to retain the precision associated with using the entire sample without requiring the extremely strong modeling assumptions that are required for a parametric method to estimate the effect for a given narrow subgroup.

Given the choice between a marginal estimand with limited clinical utility that can at least be supported by modest modeling assumptions and a conditional estimand that may be highly useful but requires extremely strict modeling assumptions (with a high burden of evidence), many causal inference scholars prefer to target marginal estimands. That is not to say these estimands are better, but rather that there is an inherent limitation in the data that must be respected. There is a burgeoning area of heterogeneous treatment effect estimation in causal inference that aims to bring the robustness and flexibility of marginal estimands to estimation of more clinically useful conditional estimands, but these methods often encounter the same issue, which is that conditional effects estimated honestly have such vast uncertainty intervals as to be uninformative. The solution is simply more data to support more specific inferences rather than hoping we get the model right.

I’m sorry if this was more of a ramble than an answer to your question. Let me know if you’d like any clarification. I’m open to disagreements on the facts of all this; this was more my attempt to explain the rationale for different choices of estimands and estimation methods rather than to claim any are more useful or more easily estimable than another with certainty. I also don’t mean to put words in anyone’s mouth, and I would love to hear views from both perspectives.

robinblythe · November 19, 2025, 3:57am

This was a very interesting read for me as someone who is still fairly new to causal inference, coming from an informatics/prediction background. I made sure to stop and read your paper on selecting the right estimand about halfway through and came back to the post afterwards.

Re: your second point (I assume this is referring to outcome regression), would it be correct to interpret the coefficient on the treatment variable as the average effect across individuals, conditional on the other covariates? If so, it does not really confront the heterogeneity problem, but instead just averages it out over the sample population. Could you not in theory then just examine the other covariates to see where the influences of these potentially heterogeneous effects are coming from?

timdisher · November 19, 2025, 11:01am

Is this still true if the data generating process is eg logit and your marginal estimand is an RD or the target marginal is non-collapsible? I thought the marginal RD/OR etc would vary based on the joint distribution of all prognostic variables in those cases?

Edit: Oh I see, is the trick that even though the marginal estimand would change in a different population you can get a unbiased estimate of it in the sample you happen to have? I was jumping straight to marginal estimates in networks of treatments/studies in a network meta analysis setting.

f2harrell · November 19, 2025, 1:43pm

I with I could have stated everything in your post as well as you did. The only part of it that I take small exception to is the above quote. It is true technically but in practice, failures caused by model misspecification typically represent the limits of our knowledge and force us to average over things we don’t know such as risk factors that were omitted from the model. Other approaches just cover up the problem. See Incorrect Covariate Adjustment May Be More Correct than Adjusted Marginal Estimates – Statistical Thinking

ngreifer · November 19, 2025, 3:21pm

I’m not sure I’m understanding what you’re asking, so I’ll try to rephrase it. Let’s say you fit the following logistic regression model for the outcome on treatment A and covariate X:

P(Y=1|A, X) = \text{expit}(\beta_0 + \beta_1 A + \beta_2 X)

This model encodes that there is no effect heterogeneity on the odds ratio scale by X. Typically, \exp(\beta_1) is interpreted as the conditional odds ratio, i.e., as \text{OR}(x)=\frac{\text{odds}(Y^1 = 1 | X=x)}{\text{odds}(Y^0 = 1 | X=x)}, which is assumed to be constant across all values of X (i.e., \text{OR}(x) = \text{OR}(x') for all \{x, x'\}).

Are you asking whether, in case there actually is true unmodeled effect heterogeneity on the odds ratio scale by X (i.e., where \text{OR}(x) \ne \text{OR}(x')), \exp(\beta_1) can instead by interpreted as an average of the conditional odds ratios? Something like E_X\left[ \text{OR}(x)\right]? To my knowledge, no. It may be some complicated variance-weighted geometric mean of the conditional odds ratios, but not one that can be validly interpreted as a meaningful average over the population.

I don’t think there is a way of examining the results of this model to understand effect heterogeneity without fitting a new model that allows the odds ratio explicitly to vary across levels of the covariates. If the original model is incorrect, it’s not clear to me what the value of \exp(\beta_1) is, even in an RCT.

ngreifer · November 19, 2025, 3:24pm

Thank you so much for the kind words! I’ve learned so much from reading your materials. I’m definitely interested in understanding the impact of model misspecification for conditional effect models better.

f2harrell · November 19, 2025, 4:56pm

I think you’re right. Witness the old Mitch Gail example that I include in 13 Analysis of Covariance in Randomized Studies – Biostatistics for Biomedical Research where the treatment OR for X=0 = 9.0 = treatment OR for X=1 but without conditiioning on X the OR is not between 9 and 9

robinblythe · November 20, 2025, 1:08am

Thanks for the response - yes, that was the question I was attempting to ask!

robinblythe · November 26, 2025, 2:28am

@f2harrell @ngreifer I have been mulling over this thread and am curious about the utility of marginal vs conditional effects here. I was hoping you could check my interpretation. I’m a health economist with a background in cost-effectiveness and am interested in estimands for the purpose of simulation modelling, so I may be misunderstanding several of the points being made. Apologies in advance if this is silly or poorly explained.

Let us assume that, in general, nearly all models are misspecified. I do not think this is an unrealistic assumption given that the reason we usually perform modelling in the first place, especially in complex systems, is that we are unaware of and cannot validate the true data generating process. We make subjective value judgements about how well our model fits, but ground truth is hidden to us. For what it’s worth, I think this also applies to the exogeneity assumption, to varying degrees including that data are often not collected properly.

That being the case, a strength of marginal estimands is that, given whatever process (e.g., IPW) you have used appears reasonable, you can make inferences about the treatment effect in the sample cohort. For example, if you’re doing a population-level simulation and believe your sample reflects the target population, the ATE is a useful modifier to estimate population-level differences between treatment and control groups (useful for cost-effectiveness, policy analysis, etc). However, it gives you little to no clue about the role of covariates or the individual risk, and you cannot generalise the ATE to individuals or even to different populations. In fact, you aren’t really concerned about correctly modelling the outcome at all - you only want to ensure that you have roughly captured the effect of a given treatment on the outcome. Is that right?

If so, then I can see how this might be useful for examining broader policies, but fairly useless for individual decision-making. Yet, if you do anticipate there are heterogeneous treatment effects between patients (which due to factors like genetics there almost certainly will be), then I’m also struggling to see the use of the conditional treatment effect either in observational studies.

Epistemically, it seems more useful to think of either conditional or marginal estimands as rough guides to whether a treatment is good or not, and qualitatively assess the size of the treatment effect, rather than believe that we have obtained a “true” unbiased effect size and interpreting the coefficient accordingly. We can make claims about causal effects but it seems naive to me to believe that your estimand is truly unbiased in virtually any case, only that it is less biased than a more simplistic comparison.

Thanks!

PS – as a tangent – I have seen McElreath advocate for a fully Bayesian specification in a causal framework, but have not come across many resources that describe this process in great detail. I would intuitively assume that it would be easy to jointly model both treatment and outcome to obtain both marginal and conditional effects, and interpret posteriors for individual covariates for both treatment and outcome models in a causal way. If you had any suggested readings, I would appreciate it.

f2harrell · November 26, 2025, 4:56pm

I resonate with just about everything you said. ATE is only for group decision making where the decision maker is not allowed to think about individual cases. Think border closings during epidemics. SATE can be estimated robustly only because of how little it is asking. PATE is usually inestimable because of lack of random sampling from the population. Individualized absolute differences must use model approximations. Risk models do not have to be perfect in order to be the best available tools. Best practice might entail pre-specifying important covariates, being flexible about how they are modeled, allowing all covariate terms to interact with treatment, and to penalize all those interactions to what best cross-validates (or has lowest possible AIC using the effective d.f. for the shrunken parameters). Often the optimum penalty is \infty because of so little evidence for heterogeneity of treatment effects oin a scale for which such heterogeneity can theoretically be absent.

The irrational fear of misspecified models has resulted in an amazing amount of unnecessary work. Models should be thought of as best current approximations and above all should use all relevant information that is in the right part of the data evolutionary path. Instead of needing to be perfect, models just need to be better than other feasible alternatives. SATE is a very poor target for judging competing methods.

robinblythe · November 27, 2025, 9:56am

Thanks for the reply, and great point about treatment interactions with covariates. I also completely agree regarding the fear of misspecification.

However, I do have a question about regularisation in that case. I was under the impression that, when seeking causal estimates, one should be cautious about doing too much variable selection without being blinded to Y. In other words, I see parallels between LASSO, ridge regression, etc. and stepwise selection. Would it not be better to a priori rule out implausible interactions first, then restrict potential interactions to only ones with a theoretical basis?

f2harrell · November 27, 2025, 3:06pm

Yes that would be better in the sense of pre-specifying candidate interactions, then shrinking the effects of all the candidates down to what the information content in the data will support. The best approach, primarily because it provides full exact inference in the presence of shrinkage priors, is Bayesian. The most general way to specify skeptical priors for interaction effects is through contrasts as shown here near the end of the chapter. Note how variable selection plays no role here except for the original model specification.

If you use penalized MLE you can only get approximate inference, but it’s not the worst thing you can do.

s_doi · November 28, 2025, 11:44am

Thanks for the detailed insights @ngreifer and my apologies for vanishing after the post as sometimes life gets in the way and had to travel. What you said is exactly what I meant to ask and were my thoughts too. We recently had a journal club at my institution and was asked which estimand to choose and why and I rambled off exactly something along the lines of your post finally concluding that if you must use PS weights, then forget about the description of the estimand in the literature and given the arbitrariness of the estimand, choosing something like the ATO may be preferable because it has less computational limitations and perhaps nothing else.

This leads me to the next point that you mentioned which is that If the odds ratio is indeed constant across subgroups and we correctly model the outcome with a logistic regression model, we can make the claim that it is equal to X for all subgroups defined by those factors. But then you say if we want to retain humility about the modeling assumptions, we need to estimate an odds ratio nonparametrically in this (and each) subgroup. If a subgroup is defined, say by the prognostic variable sex, then the gender adjusted OR applies to both men and women while the marginal OR applies to neither sex and is a meaningless average across their distributions in the sample. if we create subgroups by sex, we are no longer examining the prognostic effect of sex but in addition we have by default inserted the product term (intervention x sex) and this would mostly be an artifact of the sample. Therefore by estimating an odds ratio nonparametrically in each subgroup (I assume in this case this would mean males and females) all we are doing is modelling random error. Therefore adjustment and stratification are not equal and I cannot see any benefit of restriction or stratification over adjustment even when the regression model is not perfect.

Finally you say that even though a marginal odds ratio doesn’t have the specific problem of variation dependence, it also cannot generalize to any individual because it is not collapsible. Noncollapsibility is a marker of the absence of variation dependence with risk based measures, not a problem. Therefore the fact that the marginal odds ratio is generally smaller than conditional (or individual) odds ratios is solely because it addresses risk magnification related heterogeneity when adjusted for a prognostic variable. If the adjustment variable is not prognostic for the outcome (when there is no effect heterogeneity on the odds ratio scale) there will be no difference between marginal and conditional odds ratios so why then do you say this effect is seen when there is no effect heterogeneity?

Your thoughts on these points will be appreciated

f2harrell · November 28, 2025, 4:27pm

Well stated. To circle back to an earlier point, the nonparametric approach is naive in the sense of overfitting (once you try to estimate meaningful quantities and not crude marginals) and to me any rational approach needs to use either skeptical priors for interaction terms or penalized maximum likelihood estimation.