Take for example the ORs in the table from @Sander that you posted above. Replace if you prefer Male and Female with any biological biomarker. The behavior of the subgroup ORs compared to the whole population is very problematic when we are looking at the data trying to understand how “gender” (or “biomarkers”) interact with a therapy. It is less of a problem for those who are aware of these mathematical artifacts. But it is still an issue.
I suspect that part of what is fueling these arguments is different focus and domains of interest. There is an aspect of me that genuinely appreciates ORs and HRs for what they are. But non-collapsibility can seem more like magic in the sense that the whole is not the sum of its parts. It acts like a strongly emergent phenomenon that is inconsistent with what we have observed in the natural world to date.
Sorry, but I see little agreement over all, and like Huitfeldt I regard Doi’s ongoing responses as deeply confused about what the math doesn’t mean. That I prefer logistic over log-risk regression does not at all mean I like odds ratios as measures of association or effect! In particular I have many articles going all the way back to 1979 criticizing odds ratios and advocating conversion of logistic results to risks or rates. For example from the discussion section of Greenland Am J Epidemiol 1979, Limitations of the Logistic Analysis of Epidemiologic Data:
“The major analytic step for the epidemiologist is explanatory rather than predictive: what is sought is an explanation of the observed relationships in terms of various sources of bias in the study structure and in terms of causation in an underlying biological structure. When the data are limited or when our biological theories are vague, the features of the data requiring epidemiologic explanation may best be represented by direct comparative displays of the [logistic] smoothed rates, risk ratios, or risk differences. Attempts to represent the same features using direct display of the estimated parameters of an arbitrary statistical model can lead to the interpretational difficulties discussed [above].”
My point is that the present controversy is very old and about both the math and its interpretation. Doi’s “proof” is irrelevant not because of any math error but because there is no such thing as “the” (singular, as he uses it) strength of an association any more than there is “the” singular measure of the strength of an economy or any other multifaceted phenomenon. At a minimum (for two binary variables alone) an association involves two parameters and thus two dimensions; once we bring in covariates and multiple-level variables the number of dimensions explodes exponentially. It’s been known since at least the 19th century that you cannot capture multidimensional relations with one scalar without introducing simplifications that are absurd for our purposes. This problem applies no matter what is being touted as “the” association, whether OR, RR, RD, AUC, etc.
Then, when we bring in causality we must open up another dimension of complexity through potential outcomes or some equally rich representation. When we bring in utilities (loss functions) it gets even more complicated (although in real public-health examples RDs looks better than ratios in paralleling real utilities like caseloads). But no single measure can begin to capture all of what is going on; that is why Pearl (2000, 2009) chose to call effects the full collection of outcomes defined by his “Do” operator rather than use some single comparative measure.
On top of this complexity issue, some statisticians continue to misunderstand the noncollapsibility problem of ORs, which I’ve already explained at great length in many articles and posts going back to 1986, and which Myra Samuels did in Biometrika 1981 (which showed exactly why and how ORs were noncollapsible - that being an inevitable consequence of comparing differently weighted average odds). And yet some continue to exhibit no comprehension of what has been written by me and my coauthors and other critics of odds ratios (like Altman, Hernan, Pearl, Robins, Rubin, etc.).
Sometimes we have to face that once someone has published a fundamentally mistaken article (as did Doi et al.), they may nonetheless stand by it to the grave, invoking endless specious arguments in its defense. This happened with some “greats” of physics and statistics including Sir Harold Jeffreys and Sir Ronald Fisher, and continues today in topics of greater breadth than this one. Us lesser folk cannot expect better of our peers and can only hope to do better ourselves (or get lucky in not falling into blunders and publishing them).
My premise is that you, Senn, and Frank in broad agreement that the logistic model is useful in a wide number of scenarios; not that any of them agree with Doi, although they might be more sympathetic to using the OR in modelling and doing the statistics, vs presenting the results.
Would the following be a reasonable summary of your position? I posted this above:
So this dispute seems related to context, with RCT statisticians having fewer problems with OR, but epidemiologists finding them troublesome unless used with care.
There is more heat than light in this thread, with mapping the goals of the analysis and the context, to the mathematics. This is what I’m trying to get at.
The closest was @f2harrell when he mentioned a pragmatic attempt at non-additivity via \chi^2 for competing models:
Blockquote Frank: We need a measure of non-additivity. If overfitting were not a problem such a measure might be the likelihood ratio χ2 statistic for an additive model divided by the χ2 for a model that includes all interactions with the exposure variable.
It seems most of the problems of OR can be avoided if used with care.
I’ll post later and break this down into a more explicit decision analysis matrix if necessary.
Okay, I get your point so will not address this quantitatively as that has been done in that post. As you are a physician let me ask a slightly unrelated question. We have a patient with diabetes insipidus (lacks ADH) and thus serum Na is elevated and serum osmolality is elevated, say 310. If we give the patient 0.9% NaCl (osmolality about 280) should that not bring the serum osmolality down - (say between 310 and 280? If not, why not?
Addendum: There cannot be broad agreement on logistic regression and disagreement with Doi at the same time - in other words if the main issue is non-collapsibility then why not broad agreement on log-risk and if that fails to converge then other options like modified Poisson?
I’d convert your
“Do the statistics on the (additive) log odds scale, but convert baseline risks to the probability scale”
to something more like
“Do the modeling on the log odds (logit) scale, but convert the final results to the probability scale.”
because we shouldn’t want to limit the logit model to additive or linear (for example “machine-learning” logit models may become arbitrarily nonadditive and nonlinear). The use of the logit link is strictly a technical move to ensure proper range restriction of risks, variation independence of parameters, well-behaved fitting algorithms, and optimal calibration.
I think the problem is that some statisticians seem to think the very desirable technical advantages of the logit link lead to some broader causal or utilitarian argument for using odds ratios as final measures, which is simply a mistake. My impression is that, having seen some statisticians messing up causal inference with idiocies like significance (p<0.05) selection of confounders back in the 1970s, epidemiologic culture became more skeptical of statistical culture than did clinical culture. Some epidemiologists recognized then that not only should we discard “significance” but we should separate choice of the smoothing model from final summarization; the latter should be dictated by the utilities and goals of the context, nonparametrically, not by the statistical model. Here are a few of my writings on the topic since 1979:
Greenland S. Summarization, smoothing, and inference. Scand J Soc Med 1993;21:227–232.
Greenland S. Smoothing observational data: a philosophy and implementation for the health sciences. Int Stat Rev 2006;74:31–46.
As usual if you want to look them over and can’t access them online, e-mail me and I’ll send you a PDF. See also p. 438-442 in Ch. 21 of Modern Epidemiology 3rd ed. 2008.
I see the argument you are trying to make and it helps me understand better the source of the confusion. The diabetes insipidus numbers are explainable by causal mechanisms. They are not related to the problem of non-collapsibility.
In a similar manner I’ve anecdotally discussed non-collapsibility with outstanding mathematical statisticians who regularly publish wonderful methodological work in top statistical journals yet have never heard of the term “non-collapsible” before. This is a good example of how siloed some of these discussions can be. What I found to work in these situations is using the term “Simpson’s paradox”. But even then the first reaction is (similar to the diabetes insipidus example) to think of the Simpson’s paradox versions that can be explained causally by confounding. Non-collapsibility though is different than confounding. @AndersHuitfeldt’s point, which I agree with, is that he is unable to reconcile this with any causal mechanisms found in the natural world.
Any thoughts on this post by @f2harrell on the value of building up marginal effect estimates from reasonably adjusted conditional ones?
I’m somewhat surprised this issue hasn’t been approached from an information theoretic or Bayesian decision theoretic experimental design perspective.
Awhile back, I posted a number of links that describe Bayesian experimental design. In that framework, maximizing expected utility is equivalent to maximizing the information from the experiment.
It seems clear to me that the methods described in BBR and RMS entail maximizing information from the entire design, collection, and analysis.
It also seems clear that observational research generally has more noise in the channel, and must take more effort at increasing the information rate.
I don’t see why it isn’t possible to borrow the framework of information theory to evaluate the channel capacity of the various proposals, and then rank them in terms of information rate and ease of implementation.
Great points. I find @f2harrell’s approaches to be elegantly practical and well-thought. My statistical methodology work has been tremendously influenced by his philosophy. This includes an emphasis on risk modification in clinical decision-making that we have incorporated into our utility-based framework. However, there are aspects of research into what we can perhaps call effect modification where non-collapsibility is problematic. The table discussed above makes that point clear. And of course challenges with transportability, meta-analysis etc. It helps to discuss and be mindful of these issues.
I agree with you and indeed this has been my struggle too over the years! Coming to your last point, perhaps causal inference experts are trying to imagine theoretical ideas of mechanisms for the natural world rather than infer those mechanisms from observation of the natural world. The reason the latter is very difficult to do is that the mechanisms in the natural world are obscured by random and systematic error and thus we end up with this discussion since we are unsure whether probabilities in the real world follow a particular pattern. This is not new and Ronald Fisher has said “it has often been imagined that probabilities a priori can be set up axiomatically [i.e. in a way that seems obviously true]; this has led to many inconsistencies and paradoxes…”
Any regression model is a claim that the natural world follows certain patterns. The saturated, non-parametric analysis will be much too high dimensional, so statisticians use models as a dimension-reduction strategy. Dimension-reduction will require assumptions about how the natural world works. Those assumptions can be good or bad. If they are bad, the model will lead to bad predictions.
The assumptions are always there, regardless of whether you acknowledge them or not. By using a logistic model, you are choosing to analyze the data as if you have biological reasons for believing that the odds ratio is stable between strata. If that is true, you will get the right results. If it is not true, you will get the wrong results. The assumption is not empirically tested by the estimation procedure, which just gives you the best fit given the assumption that the odds ratio is stable.
I agree that finding the correct mechanisms is difficult (often impossible), and may not even lead to stability of any identifiable effect measure. What I don’t agree with is your implicit next step: That therefore, we should just give up, and proceed as if we knew that the true biological mechanism is one that leads to stability of the odds ratio. You don’t have a solution to the problem that the mechanism is unknowable, you just want to sweep the problem under the rug and pretend you know that the true mechanism leads to stability of the odds ratio.
Finally, let me point out something that may or not be obvious: Suppose a decision maker is choosing between option A and option B. Any regression model that is used to inform the decision by providing a prediction for what happens under option A, and a prediction for what happens under option B, is being used as a causal model. It will just be a hopelessly causal bad model unless it is constructed in accordance with principles of causal inference and in accordance with your best understanding of the mechanism/data generating function. You don’t get to use a model for decision making and then claim you aren’t doing causal inference and therefore don’t need to worry about causality or mechanisms.
Great thoughts. Our understanding of causal mechanisms has and likely always will be subjective. But we can still make plausible causal assumptions. As a medical example, take Figure 3 from our article here. It is pretty clear from our current understanding of oncology that it is not the “number of metastases” that cause the kidney cancer histology. It is the opposite: more aggressive histologies are more likely to produce metastases.
The quote by Fisher has wisdom in warning us about the challenges of prior elicitation and use. This is in many ways orthogonal to the current discussion. I use Bayesian models (priors and hyperpriors) and utilities in many of my apps (the ones that are actually most congruent with @f2harrell’s philosophy) and find them to be quite efficient and valuable. Relevant to the current discussion, the null hypothesis often used to generate the p-values that Fisher popularized, is rarely true in nature. Even more so, the more global null hypothesis assumption used during multiplicity correction (e.g., when looking at multiple gene pathways) is extremely implausible. But p-values and their multiplicity corrections have been successfully used in many scientific applications.
I acknowledge that non-collapsibility is even more unnatural (to the point of appearing more consistent with magic than with the natural world). This is why some people have such a viscerally strong reaction against non-collapsible metrics. The fact that it is a mathematical property of certain metrics we have found to be useful does not make non-collapsibility a property of nature. A good parallel is infinity. Infinity comes up in the formulas of many valuable equations successfully used in the natural sciences. But as far as we can tell to date, infinity does not actually exist in nature. Every time we get infinity in a calculation we know that the story is incomplete.
On the broad discussion, first of all I want to say that this is a truly excellent discussion that we should tweet out from time to time. Second, I want to return to the pragmatic approach. First, a few background observations:
If there is only one variable in need of consideration and that variable represents mutually exclusive categories, all link functions (logit, probit, etc.) will yield the same estimated category probabilities.
If there is only one variable and it is continuous, fitting the variable flexibly (e.g., using a cubic spline with 4-5 knots) will result in virtually the same estimated probabilities no matter what the link function.
Things get more interesting when there is > 1 predictor.
For the > 1 predictor case I think pragmatically about what is a good default starting point. I don’t think for a minute that a default approach will work for all situations/datasets. Data analysis is hard and prior knowledge about data patterns is not frequently available. So we need help. If a particular approach worked best in 8 out of 10 representative datasets I would continue to use it in many cases as a default approach. So what do we mean by “worked best?” Here is the way I think about that. Ignoring nonlinearity for the moment, I would choose 10 moderately large “representative” datasets and four or more link functions such as the identity function (additive risk model), logit (logistic model), log probability (relative risk model), and probit. For each dataset and each link function compute A = likelihood ratio \chi^2 statistic for a model containing all the predictors and all their two-way interactions, and B = the same but omitting all interactions. Compute the ratio B/A, the so-called “adequacy index” in the maximum likelihood chapter of Regression Modeling Strategies. B/A is the relative extent to which interaction terms were needed to explain outcome variation (high values indicating less need for interactions). Display B/A ratios for all datasets and all link functions. See what you learn. I postulate that link functions that carry no restrictions, i.e., logit and probit, will fare the best most of the time. I seek models in which additivity tends to explain the great bulk of outcome variation.
Then whatever link function you prefer, state estimates and uncertainty/compatibility intervals for the parameters and then use the model to estimate absolute risk reductions and relative risks when one predictors is changed and the rest are held at representative values.
This proposal is very similar to what @Sander wrote in his 1979 paper on logistic regression limitations:
Blockquote
A general paradigm can be derived … The necessity of adding complex terms (eg. product terms) to a multiplicative (additive) statistical model in order to achieve a “good fit” may indicate that the exposures under study do not influence the disease risk through a process following the multiplicative (additive) biological model; the form of the best fitting, parsimonious statistical model may in turn suggest a form of the biological model in action… Results of such generalized model searches may be combined with physiologic observations and data from animal experiments to test old and generate new models…
I also don’t understand this preference for “Marginal” models over conditional ones. Direct reference to the Nelder paper I mentioned above:
Blockquote
An advantage of a random effects model is they allow conditional inferences in addition to marginal inferences.… The conditional model is a basic model, which leads to a specific marginal model.
Anders, there is a point of correction or much greater precision needed here: You should say “By using a main-effect only (first-order) logistic model…”. Remember that any good modeler with enough data will consider product terms (“interactions” or second-order terms) that will allow the odds ratio to vary with covariates and may also investigate which link function provides the simplest linear term, as Frank discusses in his comment. Then too, with our modern computing capabilities we can use “machine learners” that start with a highly saturated model and penalize for complexity in a manner that need not shrink toward constancy of the odds ratio.
We need more attention to the distinction between the statistical task of smoothing or data stabilization (which arises because we have noisy, finite data) and the broader task of causal explanation that is part of the search for patterns stable enough for generalization to settings beyond the one that generated the data under analysis, which is what decisions demand (we make decisions for patients that were not in the study). Causal considerations are important for both tasks but become central for decisions.
I agree with this, I definitely should have been clearer. I am not sure this is central to the point I was trying to make, but the sentence on its own should have been clearer.
I prefer to think of the need for interaction terms as arising from the fact that you can often make a more plausible argument for conditional effect measure homogeneity (conditional on some subset of covariates in the model) than for complete effect measure homogeneity. Conditional effect measure homogeneity is still a homogeneity assumption, and subject to the same kind of reasoning about whether the biology is consistent with the choice of link function.
Here is what I mean by that: Suppose you have a model with predictors X, Y and Z, and a product term is needed between X and Z. You can parametrize the logistic model with separate odds ratio parameters for the effect of Z conditional on X=0, and the effect of Z conditional on X=1. This gives your model a structure where the effect of Z conditional on (X=0, Y=0) is equal to the effect of Z conditional on (X=0, Y=1) on the odds ratio scale, and that the effect of Z conditional on (X=1, Y=0) is equal to the effect of Z conditional on (X=1, Y=1) on the odds ratio scale.
This is not the standard parametrization, but the parameters in the standard parametrization are just deterministic functions of the parameters in my parametrization. The advantage of this parametrization is that it makes it easier to appreciate the nature of the conditional effect homogeneity assumptions made by a model with interactions.
In a few months at most, I will share a new preprint with significant more detail about my thoughts on regression models. I look forward to sharing that manuscript with this forum, hopefully it will make perspective clearer to everyone.
This is for me a strange question given that there are now 40 years of articles trying to explain why it is bad. Even more strange is that this is a very subtle numeric phenomenon, yet in my experience it seems more easily appreciated as a problem by clinicians than by statisticians.
I’ll give an illustrative example based on the math in Greenland (“Interpretation and choice of effect measures in epidemiologic analysis”. Am J Epidemiol 1987;125:761–8): Suppose (quite hypothetically and simplified of course) an attending physician at an ICU has just been given the abstract from a large RCT done on a random sample of all covid ICU admissions, with the treatment being oxygen via mask alone vs. intubation+sedation, and the outcome being 60-day mortality. The abstract says the average age of participants was 65, that 38% of them died within 60 days, and the observed overall odds ratio was 2.75, but results are not given for subgroups.
The physician is then confronted with a new covid ICU admission of a 65-year old man who is accompanied by a wife concerned about the long-term cognitive effects of intubation+sedation and asks for mask alone. Should the physician tell her that this will increase the patient’s odds of death by a factor of 2.75? If so that would be wrong because the risk of death varies tremendously across those known covariates, and consequently the increase in odds is likely far higher than 2.75 for this and every single specific patient! But given only the OR, the physician would be unable to say anything more precise such as how much higher.
In contrast, given that the RD was 23%, the physician could at least say that on average the individual increases were 23%, and so if this patient is not visibly atypical of trial participants then 23% would be a reasonable expectation for the risk increase from refusing intubation. In this way, the OR is much less informative about subgroups and individuals than is the RD.
Frank: Why are you taking ratios of LR statistics? They are already logs of the LRs so have been transformed from the ratio scale of the likelihoods to the additive scale of the loglikelihoods. The A-B loglikelihood difference D is the Kullback-Leibler information divergence of B from A (a measure of how much information B loses compared to A); 2D is an LR statistic (with df equal to the number of product terms in A) which provides a P-value for testing B against A.
I can compare the D, their P-values, and their S-values (surprisals) -log2(p) across links to get comparisons scaled in ways I am familiar with from 70 years of literature on model checking and selection. What is added by the B/A ratio? What does it mean in frequentist, Bayesian, likelihoodist or information terms?
I agree with Frank – we urgently need to answer this question: What about non-collapsibility is bad?
Lets take the same example given by @Sander (which is from the data in the JCE paper that I have tabulated above). The overall mortality is 38.33% and OR is 2.75 for masks versus intubation (assuming such a trial was allowed). When a new patient 65 years old comes in and we only have this information, then all we know is that in this population, given that we have a M:F ratio of 1:2, and we refuse to know the gender of the new patient (or any other demographic information), the observed mortality odds increases 2.75 fold. The wife may not know what is odds so a quick calculation can be made and the patient told that 26.7% of the intubated and 50% of the mask patients died and that the risk difference is 23.3% (for every 5 treated with a mask compared to intubation, 1 more patient died at 60 days) or the fold increase in risk was 1.875 (RR) which is almost two-fold. All this is correct and deduced from the OR (note all risk related information was derived solely from the OR using the equations in Doi et al), so I do not see any “fault” of the OR here. Perhaps @Sander can explain further why the OR was bad and the RD was not bad so we understand what we are missing here. In particular please explain why the OR is far higher for this and every single specific patient but the RD was okay even if it is a direct derivation from the OR?
I suspect @Sander will say that in males the OR is increased 6 fold for masks, so if we had this information it is not congruent with the earlier 2.75 increase in odds overall. But why is that a problem? The male gender is a very strong risk factor for death and if we refuse to know how males were distributed at baseline would we not expect the discrimination of deaths by mask to weaken? Again from the OR of 6 (Doi et al), I can tell the wife that death risk went from 60% to 90%, RD was 30% and for every 4 male patients treated with a mask compared to intubation, 1 more patient died at 60 days. This is a risk increase of 1.5 fold. Note the earlier risk increase was 1.875 fold. Obviously the RR dropped when the OR, NNT and RD worsened and given that the OR is a transformation of the C statistic we can clearly deduce that there is no monotonic relationship of the RR with death discrimination.