My premise is that you, Senn, and Frank in broad agreement that the logistic model is useful in a wide number of scenarios; not that any of them agree with Doi, although they might be more sympathetic to using the OR in modelling and doing the statistics, vs presenting the results.

Would the following be a reasonable summary of your position? I posted this above:

So this dispute seems related to context, with RCT statisticians having fewer problems with OR, but epidemiologists finding them troublesome unless used with care.

There is more heat than light in this thread, with mapping the goals of the analysis and the context, to the mathematics. This is what I’m trying to get at.

The closest was @f2harrell when he mentioned a pragmatic attempt at non-additivity via \chi^2 for competing models:

Blockquote Frank: We need a measure of non-additivity. If overfitting were not a problem such a measure might be the likelihood ratio χ2 statistic for an additive model divided by the χ2 for a model that includes all interactions with the exposure variable.

It seems most of the problems of OR can be avoided if used with care.
I’ll post later and break this down into a more explicit decision analysis matrix if necessary.

Okay, I get your point so will not address this quantitatively as that has been done in that post. As you are a physician let me ask a slightly unrelated question. We have a patient with diabetes insipidus (lacks ADH) and thus serum Na is elevated and serum osmolality is elevated, say 310. If we give the patient 0.9% NaCl (osmolality about 280) should that not bring the serum osmolality down - (say between 310 and 280? If not, why not?

Addendum: There cannot be broad agreement on logistic regression and disagreement with Doi at the same time - in other words if the main issue is non-collapsibility then why not broad agreement on log-risk and if that fails to converge then other options like modified Poisson?

I’d convert your
“Do the statistics on the (additive) log odds scale, but convert baseline risks to the probability scale”
to something more like
“Do the modeling on the log odds (logit) scale, but convert the final results to the probability scale.”
because we shouldn’t want to limit the logit model to additive or linear (for example “machine-learning” logit models may become arbitrarily nonadditive and nonlinear). The use of the logit link is strictly a technical move to ensure proper range restriction of risks, variation independence of parameters, well-behaved fitting algorithms, and optimal calibration.

I think the problem is that some statisticians seem to think the very desirable technical advantages of the logit link lead to some broader causal or utilitarian argument for using odds ratios as final measures, which is simply a mistake. My impression is that, having seen some statisticians messing up causal inference with idiocies like significance (p<0.05) selection of confounders back in the 1970s, epidemiologic culture became more skeptical of statistical culture than did clinical culture. Some epidemiologists recognized then that not only should we discard “significance” but we should separate choice of the smoothing model from final summarization; the latter should be dictated by the utilities and goals of the context, nonparametrically, not by the statistical model. Here are a few of my writings on the topic since 1979:
Greenland S. Summarization, smoothing, and inference. Scand J Soc Med 1993;21:227–232.
Greenland S. Smoothing observational data: a philosophy and implementation for the health sciences. Int Stat Rev 2006;74:31–46.
As usual if you want to look them over and can’t access them online, e-mail me and I’ll send you a PDF. See also p. 438-442 in Ch. 21 of Modern Epidemiology 3rd ed. 2008.

I see the argument you are trying to make and it helps me understand better the source of the confusion. The diabetes insipidus numbers are explainable by causal mechanisms. They are not related to the problem of non-collapsibility.

In a similar manner I’ve anecdotally discussed non-collapsibility with outstanding mathematical statisticians who regularly publish wonderful methodological work in top statistical journals yet have never heard of the term “non-collapsible” before. This is a good example of how siloed some of these discussions can be. What I found to work in these situations is using the term “Simpson’s paradox”. But even then the first reaction is (similar to the diabetes insipidus example) to think of the Simpson’s paradox versions that can be explained causally by confounding. Non-collapsibility though is different than confounding. @AndersHuitfeldt’s point, which I agree with, is that he is unable to reconcile this with any causal mechanisms found in the natural world.

Any thoughts on this post by @f2harrell on the value of building up marginal effect estimates from reasonably adjusted conditional ones?

I’m somewhat surprised this issue hasn’t been approached from an information theoretic or Bayesian decision theoretic experimental design perspective.

Awhile back, I posted a number of links that describe Bayesian experimental design. In that framework, maximizing expected utility is equivalent to maximizing the information from the experiment.

It seems clear to me that the methods described in BBR and RMS entail maximizing information from the entire design, collection, and analysis.

It also seems clear that observational research generally has more noise in the channel, and must take more effort at increasing the information rate.

I don’t see why it isn’t possible to borrow the framework of information theory to evaluate the channel capacity of the various proposals, and then rank them in terms of information rate and ease of implementation.

Great points. I find @f2harrell’s approaches to be elegantly practical and well-thought. My statistical methodology work has been tremendously influenced by his philosophy. This includes an emphasis on risk modification in clinical decision-making that we have incorporated into our utility-based framework. However, there are aspects of research into what we can perhaps call effect modification where non-collapsibility is problematic. The table discussed above makes that point clear. And of course challenges with transportability, meta-analysis etc. It helps to discuss and be mindful of these issues.

I agree with you and indeed this has been my struggle too over the years! Coming to your last point, perhaps causal inference experts are trying to imagine theoretical ideas of mechanisms for the natural world rather than infer those mechanisms from observation of the natural world. The reason the latter is very difficult to do is that the mechanisms in the natural world are obscured by random and systematic error and thus we end up with this discussion since we are unsure whether probabilities in the real world follow a particular pattern. This is not new and Ronald Fisher has said “it has often been imagined that probabilities a priori can be set up axiomatically [i.e. in a way that seems obviously true]; this has led to many inconsistencies and paradoxes…”

Any regression model is a claim that the natural world follows certain patterns. The saturated, non-parametric analysis will be much too high dimensional, so statisticians use models as a dimension-reduction strategy. Dimension-reduction will require assumptions about how the natural world works. Those assumptions can be good or bad. If they are bad, the model will lead to bad predictions.

The assumptions are always there, regardless of whether you acknowledge them or not. By using a logistic model, you are choosing to analyze the data as if you have biological reasons for believing that the odds ratio is stable between strata. If that is true, you will get the right results. If it is not true, you will get the wrong results. The assumption is not empirically tested by the estimation procedure, which just gives you the best fit given the assumption that the odds ratio is stable.

I agree that finding the correct mechanisms is difficult (often impossible), and may not even lead to stability of any identifiable effect measure. What I don’t agree with is your implicit next step: That therefore, we should just give up, and proceed as if we knew that the true biological mechanism is one that leads to stability of the odds ratio. You don’t have a solution to the problem that the mechanism is unknowable, you just want to sweep the problem under the rug and pretend you know that the true mechanism leads to stability of the odds ratio.

Finally, let me point out something that may or not be obvious: Suppose a decision maker is choosing between option A and option B. Any regression model that is used to inform the decision by providing a prediction for what happens under option A, and a prediction for what happens under option B, is being used as a causal model. It will just be a hopelessly causal bad model unless it is constructed in accordance with principles of causal inference and in accordance with your best understanding of the mechanism/data generating function. You don’t get to use a model for decision making and then claim you aren’t doing causal inference and therefore don’t need to worry about causality or mechanisms.

Great thoughts. Our understanding of causal mechanisms has and likely always will be subjective. But we can still make plausible causal assumptions. As a medical example, take Figure 3 from our article here. It is pretty clear from our current understanding of oncology that it is not the “number of metastases” that cause the kidney cancer histology. It is the opposite: more aggressive histologies are more likely to produce metastases.

The quote by Fisher has wisdom in warning us about the challenges of prior elicitation and use. This is in many ways orthogonal to the current discussion. I use Bayesian models (priors and hyperpriors) and utilities in many of my apps (the ones that are actually most congruent with @f2harrell’s philosophy) and find them to be quite efficient and valuable. Relevant to the current discussion, the null hypothesis often used to generate the p-values that Fisher popularized, is rarely true in nature. Even more so, the more global null hypothesis assumption used during multiplicity correction (e.g., when looking at multiple gene pathways) is extremely implausible. But p-values and their multiplicity corrections have been successfully used in many scientific applications.

I acknowledge that non-collapsibility is even more unnatural (to the point of appearing more consistent with magic than with the natural world). This is why some people have such a viscerally strong reaction against non-collapsible metrics. The fact that it is a mathematical property of certain metrics we have found to be useful does not make non-collapsibility a property of nature. A good parallel is infinity. Infinity comes up in the formulas of many valuable equations successfully used in the natural sciences. But as far as we can tell to date, infinity does not actually exist in nature. Every time we get infinity in a calculation we know that the story is incomplete.

On the broad discussion, first of all I want to say that this is a truly excellent discussion that we should tweet out from time to time. Second, I want to return to the pragmatic approach. First, a few background observations:

If there is only one variable in need of consideration and that variable represents mutually exclusive categories, all link functions (logit, probit, etc.) will yield the same estimated category probabilities.

If there is only one variable and it is continuous, fitting the variable flexibly (e.g., using a cubic spline with 4-5 knots) will result in virtually the same estimated probabilities no matter what the link function.

Things get more interesting when there is > 1 predictor.

For the > 1 predictor case I think pragmatically about what is a good default starting point. I don’t think for a minute that a default approach will work for all situations/datasets. Data analysis is hard and prior knowledge about data patterns is not frequently available. So we need help. If a particular approach worked best in 8 out of 10 representative datasets I would continue to use it in many cases as a default approach. So what do we mean by “worked best?” Here is the way I think about that. Ignoring nonlinearity for the moment, I would choose 10 moderately large “representative” datasets and four or more link functions such as the identity function (additive risk model), logit (logistic model), log probability (relative risk model), and probit. For each dataset and each link function compute A = likelihood ratio \chi^2 statistic for a model containing all the predictors and all their two-way interactions, and B = the same but omitting all interactions. Compute the ratio B/A, the so-called “adequacy index” in the maximum likelihood chapter of Regression Modeling Strategies. B/A is the relative extent to which interaction terms were needed to explain outcome variation (high values indicating less need for interactions). Display B/A ratios for all datasets and all link functions. See what you learn. I postulate that link functions that carry no restrictions, i.e., logit and probit, will fare the best most of the time. I seek models in which additivity tends to explain the great bulk of outcome variation.

Then whatever link function you prefer, state estimates and uncertainty/compatibility intervals for the parameters and then use the model to estimate absolute risk reductions and relative risks when one predictors is changed and the rest are held at representative values.

This proposal is very similar to what @Sander wrote in his 1979 paper on logistic regression limitations:

Blockquote
A general paradigm can be derived … The necessity of adding complex terms (eg. product terms) to a multiplicative (additive) statistical model in order to achieve a “good fit” may indicate that the exposures under study do not influence the disease risk through a process following the multiplicative (additive) biological model; the form of the best fitting, parsimonious statistical model may in turn suggest a form of the biological model in action… Results of such generalized model searches may be combined with physiologic observations and data from animal experiments to test old and generate new models…

I also don’t understand this preference for “Marginal” models over conditional ones. Direct reference to the Nelder paper I mentioned above:

Blockquote
An advantage of a random effects model is they allow conditional inferences in addition to marginal inferences.… The conditional model is a basic model, which leads to a specific marginal model.

Anders, there is a point of correction or much greater precision needed here: You should say “By using a main-effect only (first-order) logistic model…”. Remember that any good modeler with enough data will consider product terms (“interactions” or second-order terms) that will allow the odds ratio to vary with covariates and may also investigate which link function provides the simplest linear term, as Frank discusses in his comment. Then too, with our modern computing capabilities we can use “machine learners” that start with a highly saturated model and penalize for complexity in a manner that need not shrink toward constancy of the odds ratio.

We need more attention to the distinction between the statistical task of smoothing or data stabilization (which arises because we have noisy, finite data) and the broader task of causal explanation that is part of the search for patterns stable enough for generalization to settings beyond the one that generated the data under analysis, which is what decisions demand (we make decisions for patients that were not in the study). Causal considerations are important for both tasks but become central for decisions.

I agree with this, I definitely should have been clearer. I am not sure this is central to the point I was trying to make, but the sentence on its own should have been clearer.

I prefer to think of the need for interaction terms as arising from the fact that you can often make a more plausible argument for conditional effect measure homogeneity (conditional on some subset of covariates in the model) than for complete effect measure homogeneity. Conditional effect measure homogeneity is still a homogeneity assumption, and subject to the same kind of reasoning about whether the biology is consistent with the choice of link function.

Here is what I mean by that: Suppose you have a model with predictors X, Y and Z, and a product term is needed between X and Z. You can parametrize the logistic model with separate odds ratio parameters for the effect of Z conditional on X=0, and the effect of Z conditional on X=1. This gives your model a structure where the effect of Z conditional on (X=0, Y=0) is equal to the effect of Z conditional on (X=0, Y=1) on the odds ratio scale, and that the effect of Z conditional on (X=1, Y=0) is equal to the effect of Z conditional on (X=1, Y=1) on the odds ratio scale.

This is not the standard parametrization, but the parameters in the standard parametrization are just deterministic functions of the parameters in my parametrization. The advantage of this parametrization is that it makes it easier to appreciate the nature of the conditional effect homogeneity assumptions made by a model with interactions.

In a few months at most, I will share a new preprint with significant more detail about my thoughts on regression models. I look forward to sharing that manuscript with this forum, hopefully it will make perspective clearer to everyone.

This is for me a strange question given that there are now 40 years of articles trying to explain why it is bad. Even more strange is that this is a very subtle numeric phenomenon, yet in my experience it seems more easily appreciated as a problem by clinicians than by statisticians.

I’ll give an illustrative example based on the math in Greenland (“Interpretation and choice of effect measures in epidemiologic analysis”. Am J Epidemiol 1987;125:761–8): Suppose (quite hypothetically and simplified of course) an attending physician at an ICU has just been given the abstract from a large RCT done on a random sample of all covid ICU admissions, with the treatment being oxygen via mask alone vs. intubation+sedation, and the outcome being 60-day mortality. The abstract says the average age of participants was 65, that 38% of them died within 60 days, and the observed overall odds ratio was 2.75, but results are not given for subgroups.

The physician is then confronted with a new covid ICU admission of a 65-year old man who is accompanied by a wife concerned about the long-term cognitive effects of intubation+sedation and asks for mask alone. Should the physician tell her that this will increase the patient’s odds of death by a factor of 2.75? If so that would be wrong because the risk of death varies tremendously across those known covariates, and consequently the increase in odds is likely far higher than 2.75 for this and every single specific patient! But given only the OR, the physician would be unable to say anything more precise such as how much higher.

In contrast, given that the RD was 23%, the physician could at least say that on average the individual increases were 23%, and so if this patient is not visibly atypical of trial participants then 23% would be a reasonable expectation for the risk increase from refusing intubation. In this way, the OR is much less informative about subgroups and individuals than is the RD.

Frank: Why are you taking ratios of LR statistics? They are already logs of the LRs so have been transformed from the ratio scale of the likelihoods to the additive scale of the loglikelihoods. The A-B loglikelihood difference D is the Kullback-Leibler information divergence of B from A (a measure of how much information B loses compared to A); 2D is an LR statistic (with df equal to the number of product terms in A) which provides a P-value for testing B against A.

I can compare the D, their P-values, and their S-values (surprisals) -log2(p) across links to get comparisons scaled in ways I am familiar with from 70 years of literature on model checking and selection. What is added by the B/A ratio? What does it mean in frequentist, Bayesian, likelihoodist or information terms?

I agree with Frank – we urgently need to answer this question: What about non-collapsibility is bad?

Lets take the same example given by @Sander (which is from the data in the JCE paper that I have tabulated above). The overall mortality is 38.33% and OR is 2.75 for masks versus intubation (assuming such a trial was allowed). When a new patient 65 years old comes in and we only have this information, then all we know is that in this population, given that we have a M:F ratio of 1:2, and we refuse to know the gender of the new patient (or any other demographic information), the observed mortality odds increases 2.75 fold. The wife may not know what is odds so a quick calculation can be made and the patient told that 26.7% of the intubated and 50% of the mask patients died and that the risk difference is 23.3% (for every 5 treated with a mask compared to intubation, 1 more patient died at 60 days) or the fold increase in risk was 1.875 (RR) which is almost two-fold. All this is correct and deduced from the OR (note all risk related information was derived solely from the OR using the equations in Doi et al), so I do not see any “fault” of the OR here. Perhaps @Sander can explain further why the OR was bad and the RD was not bad so we understand what we are missing here. In particular please explain why the OR is far higher for this and every single specific patient but the RD was okay even if it is a direct derivation from the OR?

I suspect @Sander will say that in males the OR is increased 6 fold for masks, so if we had this information it is not congruent with the earlier 2.75 increase in odds overall. But why is that a problem? The male gender is a very strong risk factor for death and if we refuse to know how males were distributed at baseline would we not expect the discrimination of deaths by mask to weaken? Again from the OR of 6 (Doi et al), I can tell the wife that death risk went from 60% to 90%, RD was 30% and for every 4 male patients treated with a mask compared to intubation, 1 more patient died at 60 days. This is a risk increase of 1.5 fold. Note the earlier risk increase was 1.875 fold. Obviously the RR dropped when the OR, NNT and RD worsened and given that the OR is a transformation of the C statistic we can clearly deduce that there is no monotonic relationship of the RR with death discrimination.

Your response is remarkable for slipping in information where none was given: Nowhere did I state that the M:F ratio was 1:2 or that the male OR was 6, nor was any other sex specific information given or needed to make the point. Now in the example from which I took the numbers, the male OR was 6.0 and the female OR was 3.9 - both much higher than the overall causal OR of 2.75. Which only further proves the point: Without specific information about prognostic factors in the trial population, we cannot from the total OR alone provide a specific bet or even a betting range for the impact of treatment on the new patient’s odds or risk, even if that patient is a typical patient from the same population on which the trial was conducted. In sharp contrast, the RD alone provides a coherent bet for patient’s absolute risk increase.

If you don’t like the betting formulation, read the stochastic potential-outcome formulation in my 1987 AJE paper, where I show that every single individual can have an OR higher than the group OR (rendering the group OR misleading when applied to individuals) whereas this defect is not present with RRs or RDs, and group RDs are always simple averages of individual RDs. Read also Greenland Robins Pearl (“Confounding and collapsibility in causal inference”. Statistical Science 1999;14:29–46.) [For completeness I add again that hazard rate ratios suffer the problem, but in practice to a much lesser extent than ORs do.]

Again, the problem with ORs has been obvious to many people and explained in many ways over some 40 years, and does not require deep math to see; simple numeric examples suffice. So, as I said way before in this long thread, I cannot ascribe your continuing failure to see it and your responses (which as Anders noted keep slipping in assumptions or irrelevancies) to anything other than a cognitive blockage. Again, I suspect that blockage arises from finding impossible to believe that you overlooked something so obvious, plus a commitment to defend ORs as something special beyond the technical advantages (which we all recognize). Those technical advantages are countered by the amply documented interpretational disadvantage of severe noncollapsibility, which in the end makes the OR a classic example of one of the most fundamental laws of methodology: The NFLP (No-Free-Lunch Principle).

I read the first of Sander’s two new J Clin Epi papers, and like the replies above this paper comes across as “non-collapsibility is bad because it is bad”. In the medical example there is nothing bad about non-collapsibility. What is bad is that the paper the clinician is quoting did not provide a patient-specific effect estimate. If we start by treating “non-collapsibility” as pejorative we tend to end a discussion with it being pejorative.

In the history of development of methods for a 2x2 table the assumption of homogeneity is made (e.g., probability of outcome of everyone on treatment i is p_i). The simplest forms of risk differences, risk ratios, and odds ratios were designed to apply to this situation, not to the case where there is within-group outcome heterogenity. So in my personal view, indexes such as unadjusted odds ratios were designed for the homogeneous case. In the face of heterogeneity they are far less useful.

For the current purpose I could have used the classic likelihood ratio \chi^2 statistic \Delta = A-B. That statistic is proportional to the sample size. I am seeking an index that is sample-size independent. A/B is that. Another approach might be the difference in pseudo R^2; one of the terms would be \exp(-\frac{A}{n}). One could also argue for \exp(-\frac{\Delta}{n}) or just \frac{\Delta}{n} but interpretation depends on the nature of Y. For example, in time-to-event analysis the literature has shown us that the proper n to use in the denominator is somewhere between the number of events and the number of subjects.