I agree, and in general, I agree with Frank’s critiques against marginal estimands, which are all that can be targeted with PS weighting methods. Unless a sample is nationally representative, a marginal estimand averages over a potentially meaningless population. Also, marginal estimands ignore (i.e., average over) effect heterogeneity, and they can’t be used for clinical decision-making because (for binary outcomes at least) they refer only to population effect measures, which do not coincide with individual or conditional effect measures due to the problem of noncollapsibility (even in the absence of effect heterogeneity over a given effect measure).
For measures like the risk difference and risk ratio, marginal estimands are inherently problematic for clinical decision-making because, by definition of the effect measure, there is heterogeneity that is missed by the marginal estimand. A marginal risk ratio of 2 obviously can’t apply to a patient who already has a baseline risk over .5, and a marginal risk difference of .3 obviously can’t apply to a patient who has a baseline risk over .7. Even though a marginal odds ratio doesn’t have this specific problem, it also cannot generalize to any individual because it is not collapsible; the marginal odds ratio will generally be smaller than conditional (or individual) odds ratios, even when there is no effect heterogeneity on the odds ratio scale. Note that all this is in even the most optimistic scenario, in which we have a full representative sample (or census) and not just patients from a given hospital. I will never claim a marginal estimand is always the right or even a remotely useful estimand to target, and if it isn’t, then PS methods (as one’s sole adjustment method) are not the right methods for that application. Similarly, the choice of which marginal estimand to target is meaningless when all marginal estimands suffer from the same problems.
Let’s say you decide you do want a marginal estimand (i.e., for reasons of interpretability or statistical limitations). You have patients from one hospital. How do you decide among the available estimands? Typically (and in my paper on estimands), you might think that you’d choose the ATE if you are interested in a universal policy that generalizes to all units in the population under study. Putting aside the question of whether that is even useful in medical research, there is the question of who the study population is and why it’s so important to generalize to them. If the sample from a hospital represents only that hospital, what is the value of an estimand that generalizes only to patients like those in that hospital, when the result is to be presented as fact to inform practice at large? Given the arbitrariness of the target population, why not just target an estimand that can be estimated with more precision and is closer to the actually clinically useful conditional estimands, such as the ATO?
Here are some of the problems that causal inference scholars see as why we haven’t abandoned marginal estimands entirely. First, in order to estimand conditional estimands, whether in an RCT or observational study, you need to get the outcome model right, and you need to include all baseline prognostic factors. If you get the model wrong, the conditional estimates you get are biased for the true conditional effects. That is not the case for marginal estimands. As long as you correctly adjust for just the confounding variables (if any), there are many robust methods to validly estimate the marginal estimand and quantify its uncertainty, such as matching, weighting, doubly robust methods, etc.
Second, the assumption that a conditional effect measure is constant across individuals (or subgroups) is a very strong assumption to make and requires justification that can never come (i.e., accepting the null hypothesis of no effect heterogeneity). Even if it is at least biologically plausible for a relative effect measure like the odds ratio to be constant across individuals, one would be acting with undue omniscience to claim that no factors modify the conditional odds ratio. And yet, such an assumption must be encoded into the outcome model to faithfully represent this fact (unless one performs increasingly narrow subgroup analyses, which I describe below). In contrast, marginal estimands do not require the effect heterogeneity to be modelled to be estimated validly. Frank often calls this a weakness of PS methods (i.e., they average over or ignore heterogeneity), but that is a feature of the estimand itself. Even if you do not model the effect heterogeneity in, for example, a weighting analysis, the weighted estimate can be consistent for the marginal estimand under much weaker modeling conditions than are required for the consistent estimation of the full effect heterogeneity of a conditional estimand across the same population.
Third, even if you decide to move past modeling issues and focus instead on narrower subgroups, eventually you lose so much precision that the inference is meaningless. You may want to say that the effect of the drug vs placebo for Hispanic men between 35 and 45 with a family history of hypertension who concurrently take a statin is some odds ratio X. If the odds ratio is indeed constant across subgroups and you correctly model the outcome with a logistic regression model, you can indeed make this claim (and indeed a much stronger claim, that it is equal to X for all subgroups defined by those factors). But if you want to retain humility about the strength of your modeling assumptions, you need to estimate an odds ratio nonparametrically in this (and each) subgroup. You can do this with weighting methods by restricting your analysis to just this subgroup. Of course doing so would decimate your precision. Targeting a marginal estimand allows you to retain the precision associated with using the entire sample without requiring the extremely strong modeling assumptions that are required for a parametric method to estimate the effect for a given narrow subgroup.
Given the choice between a marginal estimand with limited clinical utility that can at least be supported by modest modeling assumptions and a conditional estimand that may be highly useful but requires extremely strict modeling assumptions (with a high burden of evidence), many causal inference scholars prefer to target marginal estimands. That is not to say these estimands are better, but rather that there is an inherent limitation in the data that must be respected. There is a burgeoning area of heterogeneous treatment effect estimation in causal inference that aims to bring the robustness and flexibility of marginal estimands to estimation of more clinically useful conditional estimands, but these methods often encounter the same issue, which is that conditional effects estimated honestly have such vast uncertainty intervals as to be uninformative. The solution is simply more data to support more specific inferences rather than hoping we get the model right.
I’m sorry if this was more of a ramble than an answer to your question. Let me know if you’d like any clarification. I’m open to disagreements on the facts of all this; this was more my attempt to explain the rationale for different choices of estimands and estimation methods rather than to claim any are more useful or more easily estimable than another with certainty. I also don’t mean to put words in anyone’s mouth, and I would love to hear views from both perspectives.