Confounding without noncollapsibility

Question for @Sander_Greenland
In your 1999 paper in Statistical Science on page 39 you talk about confounding without noncollapsibility in reference to the OR and give an example
If I run a logistic regression on this example I get an unconditional odds ratio of 2.67 (as you did) and conditional odds ratio of 2.67 as well.
The data below is what I used:

x y z freq
1 1 1 80
1 0 1 20
0 1 1 60
0 0 1 40
1 1 0 40
1 0 0 60
0 1 0 30
0 0 0 120

The question is that the outcome heterogeneity due to Z that leads to the noncollapsibility effect as we call it nowadays is just masked by the confounding effect and is therefore still there so why do you conclude that there is collapsibility of the OR here? Does the marginal OR not increase spuriously to 2.67 because of confounding therefore masking noncollapsibility which still exists?


“why do you conclude that there is collapsibility of the OR here?”: Because the example satisfies the definition of strict collapsibility of the odds ratio given in the paper: All conditional ORs equal the marginal OR. This definition goes back at least to the 1970s (e.g., see Bishop, Fienberg & Holland, Discrete Multivariate Analysis, 1975; Whittemore, JRSS 1978) and was subsequently generalized to weighted averages of the conditional OR and other parameters.

“Does the marginal OR not increase spuriously to 2.67 because of confounding therefore masking noncollapsibility which still exists?”:
Not according to any bona fide definition of collapsibility I have seen used in the collapsibility literature. They are all purely numerical properties of distributional features, with no need of causal structure to define or detect. The marginal OR is equal to the conditional OR so the OR is collapsible; that is the end of the story.

Pang et al. (2016) provide a more thorough study of the relation between OR noncollapsibility and confounding in

For yet another study of the defects of odds ratios and hazard ratios in evaluating treatment effects see the article by Liu et al. and its discussion here:


Thanks for your response - the Pang et al paper was particularly helpful
In the pang et al paper, they use the same example as well that you used and conclude on page 5 that “the noncollapsibility effect (-0.169) and confounding bias (+0.169) cancel each other out, thus leading to equality of the crude and conditional ORs in the presence of confounding (‘‘confounding without noncollapsibility’’).”

This implies that there is a noncollapsibility effect but it is equal in magnitude to the confounding effect but in the opposite direction and therefore “without noncollapsibility” simply means that noncollapsibility is masked by the confounding.

They also state that

logORcrude – logORc = [logORm – logORc] + [logORcrude - logORm]

Which is where the -0.169 and +0.169 come from (c is the conditional OR and m is a IPW weighted OR (marginal structural model))

The reason I think this is important is that if Z is prognostic for the outcome then the implication is that for an OR there should be no expectation that noncollapsibility is absent as implied by “without noncollapsibility” so I am surprised that Pang et al conclude the same despite showing that it exists. In this post I am not arguing about if noncollapsibility is good or bad as we already had that discussion but just want to confirm that it must be present unless there is a non-prognostic third variable in which case there would be no confounding either.


The odds ratio will also be collapsible if there is no effect in any stratum. It is also in principle possible for a non-prognostic variable to be a confounder. Otherwise I largely agree in spirit, but that doesn’t mean that anything is wrong with the Greenland or Pang papers. Their claims are correct according to the definitions used in those papers.

There are multiple definitions of collapsibility in the literature. These all refer to closely related concepts, but they are different enough to cause real confusion:

  1. The original definition (edit: not the original definition, but an influential definition which influenced a lot of early thinking) of collapsibility from Alice Whittemore’s 1978 paper, which roughly says that a measure fo association is strictly collapsible in a specific data set if the marginal measure of association is equal to the conditional measure of association in every stratum
  2. “Collapsibility of a measure of association”, I am not sure where this definition originally came from, one variant of this can be found as definition 1 in my paper On the collapsibility of measures of effect in the counterfactual causal framework | Emerging Themes in Epidemiology | Full Text . I believe this definition is closely related to the one used in papers discussed so far in this thread. In this definition, collapsibility means that the associational (uncontrolled) marginal measure of association is equal to a weighted average of the stratum-specific associational measures of association
  3. “Collapsibility of a measure of effect” from the same paper, which says that a measure of causal effect is collapsible if the causal marginal effect measure is equal to a weighted average of the conditional causal effect measures
  4. Daniel et al ( defined collapsibility based on whether the effect function is linear/affine. This is equivalent to definition 3 (at least when the outcome is binary)

The crucial point here is that your claim that “non-collapsibility is present but masked by confounding” is true when considering definitions 3 and 4, but not when considering definitions 1 or 2. In definitions 1 and 2, you can easily verify that the conditions are met by looking at the data. By definition 2, the odds ratio is collapsible in this data set, and without counterfactuals there is no way to define a property that is “masked by confounding” even if our intuitions tell us that this is what is going on.

This illustrates one of the reasons I think definitions 3 and 4 are often more useful in practice, but it does not invalidate claims that are based on definitions 1 and 2, which may very well be useful in contexts that differ from the ones I have in mind. In an ideal world, perhaps different terms should have been used differentiate the definitions from each other.


I read this as well and was very interesting. Liu et al define logic respecting mean effects as those where

Ο(z-,z+) ∈ [Ο(z-),Ο(z+)]

Thus this definition includes collapsibility if we narrowly define it as Îź(z-,z+) = Îź(z-) = Îź(z+) in the absence of confounding and interaction
As Ford et al have said (quoted in the discussion paper), the important fact is that the treatment effect does not have the same interpretation, and hence there is no reason for it to have the same numerical value, under these groupings. Therefore there should be no expectation that when a prognostic covariate is adjusted for (OR or HR), there will not be “logic disrespect”. noncollapsibility” or “inconsistency” – whatever we choose to call this phenomenon in epidemiology.

I think I am now clear that what you meant in your paper was Îź(z-,z+) = Îź(z-) = Îź(z+) while what I thought you meant was the noncollapsibility effect was absent. This actually came up in a journal club on your paper where a PhD student brought this up - I thought he had a valid point but now it is clear that this is all just terminology related.

1 Like

Although the term “collapsibility” may have first appeared in Whittemore 1978, that paper did not originate the concept; rather, it corrected earlier statements about collapsibility in the textbook by Bishop, Fienberg & Holland 1975. Whittemore showed that the sufficient conditions for collapsibility were weaker than BFH stated when collapsing over 3 or more levels, providing example distributions that we would now call “unfaithful” to their graphical representations.

The concept of collapsibility can arguably be traced to Pearson and Yule around 1900, but like many later authors they were not precise about the distinction from the concept of confounding (which can be traced even earlier, at least to Mill in the 1840s; I would bet it was known to some in ancient times). Whittemore avoided discussion of “confounding” but noted that many formal studies of what we now call collapsibility followed Simpson 1951. By 1971 Yvonne Bishop had introduced the term “collapsible” in the sense used in Whittemore 1978 and GRP 1999:

In sum, the statistical and graphical literature on collapsibility goes back far and is quite large. I doubt that it has all been absorbed into the epidemiologic methods and causality literature, although in 2011 Pearl and I discussed some less-known connections in Adjustments and their Consequences—Collapsibility Analysis using Graphical Models on JSTOR


“the important fact is that the treatment effect does not have the same interpretation, and hence there is no reason for it to have the same numerical value, under these groupings.”
-That statement needs to be refined along the following lines: It is a natural intuition to expect that a group effect measure will be an average of effects in its subgroups or among its individuals (subgroups of size 1). This intuitively natural property is obeyed by measures linear or loglinear in the outcome means, such as risk differences and ratios, but not by differences or ratios of odds or of hazard rates. Furthermore, risks typically provide a simpler and more accurate reflection of real-world cost functions (e.g., number of beds needed for patients) than do odds.

On the other hand, loglinear models for the odds or hazard rates provide the most well-behaved and easily computed smoothers for data analysis. They are thus most “natural” for statisticians, as reflected in the jargon in which log odds and log rates equal the “natural parameters” of binomial and Poisson outcomes, respectively. This creates a tension between statistics consumers who want a measure that behaves intuitively and tracks a realistic cost function, and on the other hand the technical suppliers of summary statistics who want a well-behaved easily computed data-processing algorithm.

Suppliers have at times been guilty of trying to sell their immediate product (i.e., odds or hazard ratios) as sufficient to meet consumer goals based on arguments that ignore consumer intuitions, needs, and costs. Yet quite early on, other statisticians (including Cornfield 1971 and Bishop, Fienberg & Holland 1975) recognized that the tension is resolvable by converting the statistically “natural” model outputs into the intuitively natural, collapsible measures needed for consumers to correctly grasp the outputs in context. I should hope we can agree that these long-standing approaches deserve to be implemented whenever we cannot assure that the odds or rate contrasts sufficiently approximate risk contrasts (as when the outcome may be “common” in some subgroups, or the sampling design has not forced an equality).

Here are a few of my articles on connecting “classical” acausal statistics for data processing (probability modeling) to subsequent consumer goals such as causal inference…
“Summarization, smoothing, and inference”, SJSM 1993:
“Model-based estimation of relative risks and other epidemiologic measures in studies of common outcomes and in case-control studies”, AJE 2004:
“Smoothing observational data: a philosophy and implementation for the health sciences”, ISR 2006:


I get your point but this issue only arises if we implicitly or explicitly think that estimates of effects for one group should also apply to other groups. Statisticians who advocate for the OR and logistic regression, on the other hand, consider covariates Z (whether observed or not) as inherently involved in what the average causal effect of X (assuming randomization of X) in a population means. In other words, they believe that the distribution of Z in that population sets the context in which the effect of X is realized. This means that even if there is no interaction or confounding, the log odds ratio that holds at each level of Z separately cannot be estimated from a logit model given X only but must be estimated from a model that includes Z as well (and vice versa). I don’t think experienced statisticians see any problem with this, and certainly do not see this as not logic respecting as Liu et al might put it given that outcome probability is a nonlinear function of the model. This is often presented as a problem or a limitation of the logit model in the literature because it is thought to compromise group comparisons with
such models. The evidence for the latter seems to be lacking.

I don’t think anyone can speak for “experienced statisticians” as if that represents some single-minded entity - as Tukey and others remarked, the sufficient number of statisticians to guarantee conflict over at least some statistical issue is 2; and 1 may suffice if that one is of two or more minds about the topic. I nonetheless believe much of the current conflict stems from some simply failing to appreciate the full body of literature on the topic. [Of course the situation is even worse for statistical testing, where concerns dating back well over a century continue to be raised by many, and then dismissed by those whose entire career of teaching and practice would be called into question if those concerns were validly addressed.]

Many articles have explained how it is indeed a limitation of logit/logistic models that their “natural” parameters (coefficients) are not collapsible. The limitation is a cognitive one and does indeed compromise group comparisons to the extent that the coefficient antilogs are misinterpreted as marginal effects. A related limitation is that the coefficient antilogs frequently get interpreted as risk ratios even when they are far from those, a problem that has been observed ever since odds ratios became common in the medical literature (around the 1970s or so, when logistic-regression procs began appearing in software packages).

The key problems I see in this case and others is psychosocial, not mathematical. That includes attempts to force all discussions into a single mathematical framework when the discussants share neither the same framework nor the same meanings of words and symbols. Here, noncollapsibility operates in a general framework of probability distributions in need of no further structure, whereas those who take care to distinguish that from confounding operate in a framework in which relevant distributions must be generated from causal models, leading to complications in mapping logistic-model parameters into causal effect measures (albeit easily surmountable ones).

The resulting conflicts extend well beyond confounding to scientific inference in general, which some (including me) argue is intrinsically founded on causal thinking; see for example “The causal foundations of applied probability and statistics”, 2022, [2011.02677] The causal foundations of applied probability and statistics. Standard “statistical inference” hides this fact whenever it assumes a model without explaining what observational or experimental set-ups will and won’t cause the data to follow the model. See also the writings of David A. Freedman on the abuses of regression models, such as “Statistical models and shoe leather”, Statistical Models and Shoe Leather on JSTOR [Disclosure: It was TA’ing for Freedman in grad school that started me on this train of thought.]


I agree with your paper’s conclusion that statistical science (as opposed to mathematical statistics) involves far more than data – it requires realistic causal models for the generation of that data and the deduction of their empirical consequences. I also agree with you that the key problem may be misinterpretation with one measure compared to another. But surely, neither realistic causal models for data generation, paying attention to research design nor avoiding misinterpretation will help if the effect measure itself is thought to compromise group comparisons independent of cognitive biases and properly addressed data generation mechanisms. I get that your paper suggests that the steps leading up to formal data analysis are far more important so that any issues pertaining to the modeling choices are irrelevant in comparison. In order for the former to be appreciated, the latter needs to be addressed – not dismissed. Psychosocial change can only occur when, rather than dismissal based on theory, real world consequences are shown to exist to justify the unwavering support for collapsible measures by the causal inference community. Perhaps you can point us to some papers from the applied medical literature (original studies rather than methodological) where such a consequence is clearly evident so that we can learn from them.

It is good to see the initial agreement on the general points, and I think it’s fair to ask for real examples. I have seen quite a few in my time but have not been keeping a list, so for the moment the only published examples I can point to are those of sparse-data bias in odds ratios, such as that used in

The bias is a manifestation of the noncollapsibility problem as described in Noncollapsibility, confounding, and sparse-data bias. Part 2: What should researchers make of persistent controversies about the odds ratio? - ScienceDirect
and explored in extensive formal detail in Randomization Does Not Justify Logistic Regression on JSTOR
although a closely related phenomenon can be found earlier in Andersen (1973) as bias in the unconditional MLE of the matched-pair odds ratio.

I have seen a number of real studies that used logistic regression to adjust for many confounders, and attributed estimate change upon adjustment to confounding removal when close inspection of their data and context revealed that it was instead far more likely to be only the aforementioned odds-ratio inflation; e.g., see Table 4 of
in which logistic adjustment for 4 covariates is based on 1 exposed control, leading to an OR estimate of 17 among women (CL 1.5, 182), compared to a much smaller odds ratio imputed from the tabulated numbers when no such regression adjustment is done.

That bias is perhaps better described as an inflationary effect, it being a manifestation of a mathematical noncollapsibility result which says roughly that if a randomized binary exposure X is positively associated with the binary outcome Y within all given Y-predictor (Z) partitionings of a distribution, the summary odds ratios will inflate as the Z stratification increases, rather than distribute around a marginal value as with the RD. Another take on this problem is that the usual expected loss when basing decisions on OR cutoffs is unbounded as stratification increases, whereas it is bounded when using RD cutoffs instead. Such theoretical reasons are enough for many including me to ask: Why bother defending the OR as a final measure of effect, as opposed to a statistically convenient intermediate calculation?

You wrote
“Psychosocial change can only occur when, rather than dismissal based on theory, real world consequences are shown to exist …”
I wish that sufficed. Massive real-world consequences of using “statistical significance” cutoffs for selection have been published for decades, including serious distortion of entire literatures. Yet defenders have only doubled down on maintaining such damaging practice. The situation brings to mind a general observation by Daniel Kahneman (in Thinking Fast and Slow, 2011):
“…illusions of validity and skill are supported by a powerful professional culture. We know that people can maintain an unshakeable faith in any proposition, however absurd, when they are sustained by a community of like-minded believers.”
To end on a more upbeat thought: Perhaps, unlike significance testing, the odds-ratio noncollapsibility controversy can be settled because (for most of us at least) it is a side issue and the professional stakes on past errors are not so high.


I still think it is basically impossible to imagine an individual-level data-generating mechanism that, when aggregated across individuals, leads to a non-collapsible effect measure being conditionally stable between populations. We prove some variant of this claim in section 3.6 of , where we show that it is impossible to achieve conditional stability (between two populations) of a non-collapsible effect measure by conditioning on variables that predict individual level effect, unless your conditioning set is also sufficient to obtain stability of every effect measure.


“I still think it is basically impossible to imagine an individual-level data-generating mechanism that, when aggregated across individuals, leads to a non-collapsible effect measure being conditionally stable between populations.”
If you add the proviso “when risk varies across populations (as it always does)”, I would agree and suggest that the open questions are then whether someone disagrees with that, or (even if they agree) still might argue for odds ratios as more transportable. I think those who hold to the latter view are working from a heuristic which I will try to make more precise.

First, I have never encountered a real situation in health or medicine (human or animal) in which I would expect any effect measure to be stable across levels of actual measured outcome predictors or across real populations. The underlying social and biological processes are always far too complex and our measurements too limited to expect anything other than some heterogeneity for every measure.

Recognizing that reality, we then have to face the painful fact that, even when the heterogeneity is estimable in some large-sample sense, our data simply can’t support such high-dimensional estimation. So we have to fall back to biased estimation of effects via smoothing (shrinkage) toward simplified models that either impose some type of parametric homogeneity and smoothness conditions (as in classical regression) or else more opaque nonparametric conditions (as in machine learning). The estimates from parametric model components should viewed as empirical-Bayes or penalized predictions, not unbiased point estimates as in the conventional maximum-likelihood/estimating-equation fantasy.

We rarely have much guidance from more basic mechanistic sciences to help us select the model family or approach; hence we fall back to technical considerations of simplicity and accuracy (including rapid convergence to asymptotic behavior and automatic obedience of logical range restrictions). For parametric smoothing those considerations dictate using noncollapsible parameterizations like logistic risk and log-linear rate models as the smoother. Adding to that, with uncommon outcomes or certain sampling designs, their coefficient antilogs serve as simple estimates of risk ratios, making the noncollapsibility objection to them of no practical consequence in those cases; and when it is a concern, the conversion of the modeling outputs to collapsible measures is straightforward. On this topic I’ll point back to my earlier cites to papers from 1993, 2004, and 2006, as well as Ch. 21 of Modern Epidemiology 2nd and 3rd ed. (as an historical aside, my first article expressing that view was in AJE 1979,

a view which was derived from the coverage of smoothing given in the standard discrete multivariate analysis textbook of that time, Bishop, Fienberg and Holland 1975).

In making these practical concessions to noncollapsible models, we should not be tempted to go further and adopt inapplicable or fallacious rationalizations for taking odds ratios as preferable measures of effect. Even as data summaries, odds ratios require attention to their noncollapsibility. If in response we turn instead to collapsible measures, we should recognize that collapsibility is far from enough to justify a measure for all purposes. For example, as long noted a risk ratio is inadequate if not misleading for real intervention decisions: a risk ratio of 10 has vastly different implications if concern is with an outcome having background risk of 1 in 100 vs. 1 in 1,000,000,000. Thus, in terms of aiding real harm-benefit evaluations, risk ratios are not much better than odds ratios and are barely different from hazard-rate ratios (which are almost collapsible most of the time); yet all three ratios are standard summaries in medical research papers and reviews, including those used for policy and practice recommendations.

1 Like

I think I must press this point if we are to have one position dominate. These are still (interesting nonetheless) methodological papers and the NEJM paper was used as an example in your BMJ paper for sparse data bias. This has to be taken a step at a time if the users (and eventually creators - if users gain sufficient expertise) of applied or clinical research are to take notice. What is needed are real world examples of compromised group comparisons due to use of logits in regression or the OR as an effect size otherwise it is unlikely that users of the applied literature (clinicians for example) will be swayed given that there are enough experts on either side of the debate for equipoise. Surely, there must be at least one major clinical study that impacted care which can serve as an example of compromised group comparisons directly due to the use of the OR as an effect measure rather than due to sparse data or similar considerations not uniquely tagged to the interpretation of the effect measure. If not, then we move towards the conclusion that this is not as serious a problem as the causal inference community make it out to be.

“due to sparse data or similar considerations not uniquely tagged to the interpretation of the effect measure.”
I find that comment unhelpful, insofar as Kernan et al. and similar examples were used as evidence in litigation as arguing for causation based on the “strength of association” criterion. Conversely, noncollapsibility is an issue to watch out for regardless of whether the goal is causal inference or mere description.
“we move towards the conclusion that this is not as serious a problem as the causal inference community make it out to be.”
Again, you are writing as if there is a singular mind defined by a heterogeneous grouping, this time “the causal inference community” (last time it was “experienced statisticians”). Such a description is misleading for two reasons:
First, there is disagreement about letting collapsibility drive choice of parametric “analysis” models, with those like myself pointing out that (for example) additive-risk models can behave very poorly in practice compared to logistic or even loglinear risk models, and there is no compelling need for such technically misbehaved models when as usual we lack empirical justifications for any model. In the latter case the only purpose of modeling I can see is noise reduction for an acceptable bias cost (the old standard-deviation vs. bias trade-off), and logistic models are a central tool in that task.

Second, I have not seen where anyone wrote that this is one of the most important problems facing statistical analysis. It has been more of a very telling educational example displaying how traditional statistical mathematics can leave many people confused about a simple distinction. That happens when we fail to extend modeling beyond ordinary acausal probability to incorporate a causal component, in order to properly reflect confounding concerns in research.

What happened historically is that by the 1970s some authors formally identified confounding with noncollapsibility, subject to some causal-order restrictions on the variables. That partial formalization limped along adequately for most practice (and I used it a lot for decades) but soon revealed itself as inadequate in general once logistic regression and odds ratio presentation became standard practice. The mistake was to cling to that formalization of confounding long after its deficiencies were pointed out and better formalizations were provided. Whether this mistake was “important” is not a valid issue here or in logic in general - after all, nothing is important in a broad enough perspective (recall Keynes commenting on the absurdity of naive frequentism: “in the long run we shall all be dead”).

My view is that our first order of affairs as theoreticians and methodologists should be to rectify logical and semantic mis-steps when we recognize them, not defend them because we made them, or because they are traditions handed down by “great men”, or because they are not “important” (which alas I have done at one time or another and anyone around long enough has done too).

Another example of a cognitive-psychosocial problem in statistical theory persists today in the form of insistence that properly randomized trials can’t be confounded. It also illustrates another split in the causal-inference community: I see the position as a fallacy stemming from reification of a hypothetical long run of allocations, leading to formalizing the old commonsense notion of “no confounding” as randomization. The defenses of this formal redefinition illustrate once more the Kahneman quote I gave earlier in this thread. The old informal notions correctly intuited that commonsense confounding (mixing up of effects in the unadjusted association of treatment with outcome) is signaled whenever the treatment is associated with baseline outcome predictors in the actual observed cohort, even in randomized studies. Randomization only enforces independence of treatment and baseline risk over a counterfactual long run that provides a foundation for mathematical uncertainty assessments (which only have force if the study conduct was nearly perfect). The fact that randomization makes some degree of dependence or greater unlikely over a nonexistent long run becomes irrelevant once we see the dependence in our data. For further discussion see


Effect measures play a very significant role in how clinical medicine works in 2022:

  • The Cochrane Handbook, also known as the bible of Evidence-Based Medicine, specifically instructs authors of meta-analyses to obtain an estimate of the patient’s risk under the control condition, and combine this with an estimate of the risk ratio in order to predict the risk under treatment (and then convert this to a risk difference scale, which is believed to reflect information that is more appropriate for the needs of the decision maker). The GRADE framework contains identical advice, and this is often also repeated by leading thinkers in Evidence-Based Medicine on Twitter.

  • In cost-effectiveness analyses (including those used to decide whether a drug will be covered by insurance or similar government funded programs), health economists often use decision tree models with parameters for the effect of the medication taken from the empirical literature. The people who provide such reports to the Norwegian Health Authorities have explicitly told me that they rely on the risk ratio for this purpose because they have been told that it is stable, and that if it was not stable, they would have to rethink how they do their models (presumably sometimes leading to different conclusions about whether a drug is considered sufficiently cost-effective)

These are both examples of high-stakes decisions which may have different outcomes if the analysis relied on a different effect parameter. You can in principle argue that this problem will be “solved” by machine learning and semi-parametric methods that are hoped to be invariant to parametrization. But these methods do not lend themselves to heuristics that can be used at the point of decision, because the results from such analyses cannot be memorized by any human decision maker. To be usable, it will be necessary for a computer to provide individual-level predictions for every patient, so that the doctors role will be to tell the patient what the AI predicted, presumably without being able to give much insight about why the machine made that prediction.

It is possible (but contestable) that this is the future we are hoping for, but until we get there, it seems that we can pick some low-hanging fruits by coming to a principled agreement over how to choose between effect measures. Moreover, it is not at all clear to me that this non-parametric individualization will work well when there is need for generalization, i.e. when the patient differs in important ways from the participants in the randomized studies. My view is that the choice between effect measures will always be crucial in such settings.

While I agree in principle that we will rarely be able to rely on mechanistic sciences to select a perfect model, I think there is significant value to having a stylized toy model which would theoretically lead to stability of a particular effect measure, such that we can think about all the biologically interpretable ways that reality might deviate from this toy model, and interpret heterogeneity as such deviation. Such a toy model would then give guidance for when there is a need for a product term, when the analysis is expected to fail completely due to heterogeneity, etc.

I want to note that (unlike some economists) I am not arguing for additive models for binary outcomes. Linear probability models are not the only alternative to logistic regression, and some of these other alternatives are able to retain some of the attractive features of logistic regression yet avoid non-collapsibility. For example, I am working with Rhian Daniel, Daniel Farewell and Mats Stensrud on a new framework for regression models called Regression by Composition, which we believe will be a very attractive option. I know I have mentioned this model before, and that from the outside perspective, we must be close to the stage where it is starting to sound like vaporware. But from looking at the state of the manuscript today, I am convinced we will be able to share an early working paper with trusted readers by the end of this year.

1 Like

I found your comments largely agreeable, subject to some cautionary footnotes:

  • The Cochrane Handbook should not be taken as a bible (no science text should, including Modern Epidemiology). In particular, I hold that the excerpted advice needs serious technical cautions (reflecting concerns raised by Doi, Harrell and Senn among others) because I have seen it lead to unnecessary inaccuracies and effect obscuration in meta-analyses. One can obtain more accurate yet equally simple risk predictions by filtering through the logistic-fitted odds ratio rather than a loglinear-fitted risk ratio: Yes, one should start with the fitted baseline risk R0 and end with the treated risk R1, but the accuracy of the conversion via an estimated risk ratio RR to R1 = RR(R0) can be badly compromised if RR is taken directly from a random-effects loglinear-risk model, compared to the more accurate indirect RR estimate from a logistic risk model. [The conversion from covariate-specific logistic-fitted odds ratios can be made quite simple by rewriting it in odds form O1 = R1/S1 = OR(R0/S0) = OR(O0) where Sj = 1-Rj, then taking the final step R1 = O1/(1+O1). None of this assumes that these covariate-specific OR are constant, as the logit-linear predictor may contain higher-order terms, nor is it hampered by OR noncollapsibility because the OR is only used as an intermediate between the risks.]
  • “leading thinkers in Evidence-Based Medicine on Twitter” - At the risk of offending the new owner of Twitter, I hope that mention is not meant to point to those Twitter sources as supportive evidence of validity. Much bad methodologic advice became enforced practice simply because the advice came from those celebrated as “leading thinkers” by an academic faction with much power over publication - a faction which then ignored and sometimes suppressed warnings from method critics and even the method originators. Witness again the history and ongoing saga of significance testing.
  • It would seem that the Norwegian Health Authorities (like most) ought to rethink how they do their models, because (again) in practice no measure has valid evidence of stability when health effects are present, and tests of stability have too little sensitivity (power) to reliably detect that problem. Thus stability assumptions can only be justified as devices for average-loss reduction via statistical stabilization. Indeed, perhaps claims of measure stability (whether for OR or RR) represent only a misinterpretation of statistical stabilization, much the way clinical and statistical significance gets confused.
  • “it will be necessary for a computer to provide individual-level predictions for every patient”: This necessity goes back at least the dawn of biostatistical computer science in the 1960s with Cornfield’s work on the application of logistic regression to clinical prediction. It soon achieved some level of operationalization using cardiovascular risk scores generated from the Framingham Study. This was a primitive form of machine learning, the only form practical then given the extraordinarily limited computers of the day (which had less power and far less speed than our smartphones, and which had to be laboriously programmed by computer specialists using punch cards). Its routine adoption today is enabled by the fact that any physician could obtain these predictions online in an instant given access to packaged prediction functions and software to transfer patient information into the functions.
  • “so that the doctor’s role will be to tell the patient what the AI predicted, presumably without being able to give much insight about why the machine made that prediction”. From what I have seen, I strongly doubt that most medical practitioners have ever been able to provide much valid insight about their detailed prognostications, especially with regard to critical analysis of the literature and methods from which the predictions originated. We should hope that personalized risk scores represent a more reliable synthesis of available information than the raw clinical intuitions they supplement or replace, but of course that needs empirical study in each application.
  • Regarding mechanistic models, I used to argue similarly and still agree that working through toy models is an instructive academic exercise (e.g., to show why logistic risk models are “natural” only in a statistical sense rather than a biomedical one). Nonetheless, slowly, through experience and observation, I came to agree with those who argued that toy models are also potentially misleading because the underlying biology and study complexities are usually far too extensive for such exercises to provide more than speculations about actual disease mechanisms. There are exceptions of course (e.g., studies of the age distribution of carcinomas) so I would not discourage the exercise in basic teaching, as long as it is also explained why it is not so helpful for routine use in real analyses, and hazardous to the extent it invites presentation of mechanistic speculations as if those were reliable inferences.

Thank you Sander. If I can make just a few additional points, I want to point out that by necessity, the cost effectiveness analyses rely on summary information extracted from the published literature. If we want to make them invariant to parameter choice, I believe the analysts would require access to individual-level data. That is not realistic in the current situation, it would require not just a change to the methodology and work flow of the cost-effectiveness analysts themselves, but a large scale disruption to how scientific information is made available, and to the privacy laws that make access difficult.

When I refer to leading EBM thinkers ( and , I am not invoking their views as an appeal to authority, I don’t even always agree with their preference for the relative risk. Rather, I am trying to establish that their views reflect the social consensus among the subset of experts whose views are relevant to clinical medicine: The committees that write clinical practice guidelines essentially have no choice other than deferring to their methodological recommendations, and individual clinicians have no choice other than deferring to the clinical practice guidelines. In many countries including Norway (and I believe the UK?), this is essentially official government policy: large teams of analysts are employed in the public sector to analyze data and make clinical recommendations, and these teams would always defer the the methodological consensus among people like Guyatt and Peto.

For this reason, the role that the EBM framework plays within medicine is almost comparable to the standard model in physics (I am not making the claim that EBM is as well justified as the standard model, but rather that it is almost as well established among thought leaders within the academic discipline, and that it has implications that are almost as wide-reaching). It is the framework we teach in medical schools, and we expect clinicians to rely on it in the practice of medicine. Any update to the standard model would be practically important, even if the possibility remains that it will one day be made irrelevant by a unified field theory.

To give a somewhat exaggerated and self-aggrandizing metaphor for how this discussion feels from my perspective: I believe that I am providing a significant patch to the standard model. This patch would lead to data analyses that are better justified and correspond better to reality. It would enable users to have better intuition about what their models are doing, and hopefully lead to better clinical decisions without having to change any other aspects of how EBM is practiced.

The response has been negative on two fronts: Those who are operating within the standard model often refuse to acknowledge that the argument is even coherent, perhaps because they are too unused to thinking in counterfactual terms. Whereas some who are operating outside the standard model consider my patch to be methodologically correct (in the sense that the underlying mechanism is valid if not reflective of reality, and the conclusions follow from the premises) but irrelevant, because they have a strong sense that it simplifies reality too much, and because they hope that the unified field theory that they are working on will be able to solve the underlying problem without having to make such simplifications.

Perhaps they are right that we are close to a clean break from having to rely on any effect measures. But if so, I feel fairly certain that this would require a massive change to the basics of how we think about medicine. In particular, I think that in this hypothetical future, anyone who makes a decision would need access to individual-level trial data, or at least to an artificial intelligence that has access to such individual data and can use it to make individualized predictions

1 Like

The debate and dialog here seems to have died down compared to the earlier threads touching on this topic. Perhaps that indicates some convergence, even if not complete agreement on details.

First, I should perhaps note that I was never a reviewer for any of your (Ander’s) papers. From what I have seen of them, had I been a reviewer I would have approved publication, even though we are not perfectly aligned on all details.

Second, I don’t believe all-encompassing methodologic theories are realistic goals for soft fields like health and medical science (by “soft” I mean lacking in fundamentally reliable, general, precise and accurate laws based on vast corroborating data). Even less do I think sweeping philosophies or theories can encompass all of scientific methodology or ensure progress. While some methodology may be sufficient for some specific task or even be usually necessary as part of larger strategies, claims or insinuations of universal sufficiency or completeness are in my mind antiscientific.

Instead, at best we get some tools whose limitations are as important to understand as their proper use and abilities. At worst we get promotion of narrow toolkits as if they are applicable and must be applied to everything. That leads to disasters like the naive frequentism that warped soft sciences in second half the 20th century and continues to today in many quarters. I say that while also thinking that naive Bayesianism is as bad and would have done similar damage had it been in power instead of being the underdog. And even if frequentism and Bayesianism are fused together (as many, myself included, argue can and should be done) that syncretism is far from sufficient, as witnessed by the need for causal extension of all statistical models.

Third, you said “I think that in this hypothetical future, anyone who makes a decision would need access to individual-level trial data, or at least to an artificial intelligence that has access to such individual data and can use it to make individualized predictions.” Again, that’s already been happening for generations: All exercises in statistical prediction are exercises in artificial intelligence, to the extent we let general programmed algorithms take in the data and allow our decisions to follow the outputs. Without realizing it perhaps, “statistical inference” is about allowing an artificial algorithm to at least partially judge for us who is high or low risk, or a good or poor treatment candidate, etc. Modern “machine-learning” algorithms are simply based on more complex (usually nonparametric) models than are traditional parametric and semi-parametric models. And yet those models will still fail to address unmodeled bias sources, and thus mislead us whenever we rely on them without recognizing these omissions. This is a limitation and hazard of all modeling exercises.

There have been many tasks in which statistical prediction models developed from individual data have been used to provide individual risks. This is done routinely for example in predicting credit default, consumer purchases, and now also prescription adherence. That clinical prediction seems slower on the uptake might be attributed to certain social factors that dominate health care and health research, rather than intrinsic difficulties of individualized prediction. Not that individualized prediction will entirely replace all use of measures of association and effect; but were I investing in the growth and successes of these approaches in the future, my money would be on individualized prediction.