Should one derive risk difference from the odds ratio?

I am not a fan of the log link in binomial regression either, perhaps more so than @Sander so no counter arguments here

I will cast my vote as well and say that although I do appreciate the practical advantages of HRs and ORs in many contexts, as a physician-scientist connecting laboratory research with clinical observations I do find the non-collapsibility of these metrics to be very troubling. There is certainly space and value in my line of work for metrics that are more directly anchored to causal mechanisms.

Perhaps you could post a real life example of how non-collapsibility of the OR led to a problem in the case you mentioned. That would allow us to discuss this more in depth based on the problem

My claim is that it will be impossible for you to provide a causal model; i.e. a set of statements describing biological reality (which you can justify with reference only to biology), such that those statements together imply stability of a non-collapsible effect measure.

In other words, there is no way nature works in the way that is assumed by a model using a non-collapsible parameter. You can still “use” the model in the sense that you can estimate its parameters and make predictions. This may be mathematically convenient. But the justification for the procedure depends on nature working in a specific way, and there is just no way that nature works that way.

Take for example the ORs in the table from @Sander that you posted above. Replace if you prefer Male and Female with any biological biomarker. The behavior of the subgroup ORs compared to the whole population is very problematic when we are looking at the data trying to understand how “gender” (or “biomarkers”) interact with a therapy. It is less of a problem for those who are aware of these mathematical artifacts. But it is still an issue.

I suspect that part of what is fueling these arguments is different focus and domains of interest. There is an aspect of me that genuinely appreciates ORs and HRs for what they are. But non-collapsibility can seem more like magic in the sense that the whole is not the sum of its parts. It acts like a strongly emergent phenomenon that is inconsistent with what we have observed in the natural world to date.

Sorry, but I see little agreement over all, and like Huitfeldt I regard Doi’s ongoing responses as deeply confused about what the math doesn’t mean. That I prefer logistic over log-risk regression does not at all mean I like odds ratios as measures of association or effect! In particular I have many articles going all the way back to 1979 criticizing odds ratios and advocating conversion of logistic results to risks or rates. For example from the discussion section of Greenland Am J Epidemiol 1979, Limitations of the Logistic Analysis of Epidemiologic Data:

  • “The major analytic step for the epidemiologist is explanatory rather than predictive: what is sought is an explanation of the observed relationships in terms of various sources of bias in the study structure and in terms of causation in an underlying biological structure. When the data are limited or when our biological theories are vague, the features of the data requiring epidemiologic explanation may best be represented by direct comparative displays of the [logistic] smoothed rates, risk ratios, or risk differences. Attempts to represent the same features using direct display of the estimated parameters of an arbitrary statistical model can lead to the interpretational difficulties discussed [above].”

My point is that the present controversy is very old and about both the math and its interpretation. Doi’s “proof” is irrelevant not because of any math error but because there is no such thing as “the” (singular, as he uses it) strength of an association any more than there is “the” singular measure of the strength of an economy or any other multifaceted phenomenon. At a minimum (for two binary variables alone) an association involves two parameters and thus two dimensions; once we bring in covariates and multiple-level variables the number of dimensions explodes exponentially. It’s been known since at least the 19th century that you cannot capture multidimensional relations with one scalar without introducing simplifications that are absurd for our purposes. This problem applies no matter what is being touted as “the” association, whether OR, RR, RD, AUC, etc.

Then, when we bring in causality we must open up another dimension of complexity through potential outcomes or some equally rich representation. When we bring in utilities (loss functions) it gets even more complicated (although in real public-health examples RDs looks better than ratios in paralleling real utilities like caseloads). But no single measure can begin to capture all of what is going on; that is why Pearl (2000, 2009) chose to call effects the full collection of outcomes defined by his “Do” operator rather than use some single comparative measure.

On top of this complexity issue, some statisticians continue to misunderstand the noncollapsibility problem of ORs, which I’ve already explained at great length in many articles and posts going back to 1986, and which Myra Samuels did in Biometrika 1981 (which showed exactly why and how ORs were noncollapsible - that being an inevitable consequence of comparing differently weighted average odds). And yet some continue to exhibit no comprehension of what has been written by me and my coauthors and other critics of odds ratios (like Altman, Hernan, Pearl, Robins, Rubin, etc.).

Sometimes we have to face that once someone has published a fundamentally mistaken article (as did Doi et al.), they may nonetheless stand by it to the grave, invoking endless specious arguments in its defense. This happened with some “greats” of physics and statistics including Sir Harold Jeffreys and Sir Ronald Fisher, and continues today in topics of greater breadth than this one. Us lesser folk cannot expect better of our peers and can only hope to do better ourselves (or get lucky in not falling into blunders and publishing them).

4 Likes

My premise is that you, Senn, and Frank in broad agreement that the logistic model is useful in a wide number of scenarios; not that any of them agree with Doi, although they might be more sympathetic to using the OR in modelling and doing the statistics, vs presenting the results.

Would the following be a reasonable summary of your position? I posted this above:

So this dispute seems related to context, with RCT statisticians having fewer problems with OR, but epidemiologists finding them troublesome unless used with care.

There is more heat than light in this thread, with mapping the goals of the analysis and the context, to the mathematics. This is what I’m trying to get at.

The closest was @f2harrell when he mentioned a pragmatic attempt at non-additivity via \chi^2 for competing models:

Blockquote
Frank: We need a measure of non-additivity. If overfitting were not a problem such a measure might be the likelihood ratio χ2 statistic for an additive model divided by the χ2 for a model that includes all interactions with the exposure variable.

It seems most of the problems of OR can be avoided if used with care.
I’ll post later and break this down into a more explicit decision analysis matrix if necessary.

Okay, I get your point so will not address this quantitatively as that has been done in that post. As you are a physician let me ask a slightly unrelated question. We have a patient with diabetes insipidus (lacks ADH) and thus serum Na is elevated and serum osmolality is elevated, say 310. If we give the patient 0.9% NaCl (osmolality about 280) should that not bring the serum osmolality down - (say between 310 and 280? If not, why not?

1 Like

That would certainly be my position :grinning:

Addendum: There cannot be broad agreement on logistic regression and disagreement with Doi at the same time - in other words if the main issue is non-collapsibility then why not broad agreement on log-risk and if that fails to converge then other options like modified Poisson?

I’d convert your
“Do the statistics on the (additive) log odds scale, but convert baseline risks to the probability scale”
to something more like
“Do the modeling on the log odds (logit) scale, but convert the final results to the probability scale.”
because we shouldn’t want to limit the logit model to additive or linear (for example “machine-learning” logit models may become arbitrarily nonadditive and nonlinear). The use of the logit link is strictly a technical move to ensure proper range restriction of risks, variation independence of parameters, well-behaved fitting algorithms, and optimal calibration.

I think the problem is that some statisticians seem to think the very desirable technical advantages of the logit link lead to some broader causal or utilitarian argument for using odds ratios as final measures, which is simply a mistake. My impression is that, having seen some statisticians messing up causal inference with idiocies like significance (p<0.05) selection of confounders back in the 1970s, epidemiologic culture became more skeptical of statistical culture than did clinical culture. Some epidemiologists recognized then that not only should we discard “significance” but we should separate choice of the smoothing model from final summarization; the latter should be dictated by the utilities and goals of the context, nonparametrically, not by the statistical model. Here are a few of my writings on the topic since 1979:
Greenland S. Summarization, smoothing, and inference. Scand J Soc Med 1993;21:227–232.
Greenland S. Smoothing observational data: a philosophy and implementation for the health sciences. Int Stat Rev 2006;74:31–46.
As usual if you want to look them over and can’t access them online, e-mail me and I’ll send you a PDF. See also p. 438-442 in Ch. 21 of Modern Epidemiology 3rd ed. 2008.

2 Likes

I see the argument you are trying to make and it helps me understand better the source of the confusion. The diabetes insipidus numbers are explainable by causal mechanisms. They are not related to the problem of non-collapsibility.

In a similar manner I’ve anecdotally discussed non-collapsibility with outstanding mathematical statisticians who regularly publish wonderful methodological work in top statistical journals yet have never heard of the term “non-collapsible” before. This is a good example of how siloed some of these discussions can be. What I found to work in these situations is using the term “Simpson’s paradox”. But even then the first reaction is (similar to the diabetes insipidus example) to think of the Simpson’s paradox versions that can be explained causally by confounding. Non-collapsibility though is different than confounding. @AndersHuitfeldt’s point, which I agree with, is that he is unable to reconcile this with any causal mechanisms found in the natural world.

Any thoughts on this post by @f2harrell on the value of building up marginal effect estimates from reasonably adjusted conditional ones?

I’m somewhat surprised this issue hasn’t been approached from an information theoretic or Bayesian decision theoretic experimental design perspective.

Awhile back, I posted a number of links that describe Bayesian experimental design. In that framework, maximizing expected utility is equivalent to maximizing the information from the experiment.

It seems clear to me that the methods described in BBR and RMS entail maximizing information from the entire design, collection, and analysis.

It also seems clear that observational research generally has more noise in the channel, and must take more effort at increasing the information rate.

I don’t see why it isn’t possible to borrow the framework of information theory to evaluate the channel capacity of the various proposals, and then rank them in terms of information rate and ease of implementation.

1 Like

Great points. I find @f2harrell’s approaches to be elegantly practical and well-thought. My statistical methodology work has been tremendously influenced by his philosophy. This includes an emphasis on risk modification in clinical decision-making that we have incorporated into our utility-based framework. However, there are aspects of research into what we can perhaps call effect modification where non-collapsibility is problematic. The table discussed above makes that point clear. And of course challenges with transportability, meta-analysis etc. It helps to discuss and be mindful of these issues.

1 Like

I agree with you and indeed this has been my struggle too over the years! Coming to your last point, perhaps causal inference experts are trying to imagine theoretical ideas of mechanisms for the natural world rather than infer those mechanisms from observation of the natural world. The reason the latter is very difficult to do is that the mechanisms in the natural world are obscured by random and systematic error and thus we end up with this discussion since we are unsure whether probabilities in the real world follow a particular pattern. This is not new and Ronald Fisher has said “it has often been imagined that probabilities a priori can be set up axiomatically [i.e. in a way that seems obviously true]; this has led to many inconsistencies and paradoxes…”

1 Like

Any regression model is a claim that the natural world follows certain patterns. The saturated, non-parametric analysis will be much too high dimensional, so statisticians use models as a dimension-reduction strategy. Dimension-reduction will require assumptions about how the natural world works. Those assumptions can be good or bad. If they are bad, the model will lead to bad predictions.

The assumptions are always there, regardless of whether you acknowledge them or not. By using a logistic model, you are choosing to analyze the data as if you have biological reasons for believing that the odds ratio is stable between strata. If that is true, you will get the right results. If it is not true, you will get the wrong results. The assumption is not empirically tested by the estimation procedure, which just gives you the best fit given the assumption that the odds ratio is stable.

I agree that finding the correct mechanisms is difficult (often impossible), and may not even lead to stability of any identifiable effect measure. What I don’t agree with is your implicit next step: That therefore, we should just give up, and proceed as if we knew that the true biological mechanism is one that leads to stability of the odds ratio. You don’t have a solution to the problem that the mechanism is unknowable, you just want to sweep the problem under the rug and pretend you know that the true mechanism leads to stability of the odds ratio.

Finally, let me point out something that may or not be obvious: Suppose a decision maker is choosing between option A and option B. Any regression model that is used to inform the decision by providing a prediction for what happens under option A, and a prediction for what happens under option B, is being used as a causal model. It will just be a hopelessly causal bad model unless it is constructed in accordance with principles of causal inference and in accordance with your best understanding of the mechanism/data generating function. You don’t get to use a model for decision making and then claim you aren’t doing causal inference and therefore don’t need to worry about causality or mechanisms.

2 Likes

Great thoughts. Our understanding of causal mechanisms has and likely always will be subjective. But we can still make plausible causal assumptions. As a medical example, take Figure 3 from our article here. It is pretty clear from our current understanding of oncology that it is not the “number of metastases” that cause the kidney cancer histology. It is the opposite: more aggressive histologies are more likely to produce metastases.

The quote by Fisher has wisdom in warning us about the challenges of prior elicitation and use. This is in many ways orthogonal to the current discussion. I use Bayesian models (priors and hyperpriors) and utilities in many of my apps (the ones that are actually most congruent with @f2harrell’s philosophy) and find them to be quite efficient and valuable. Relevant to the current discussion, the null hypothesis often used to generate the p-values that Fisher popularized, is rarely true in nature. Even more so, the more global null hypothesis assumption used during multiplicity correction (e.g., when looking at multiple gene pathways) is extremely implausible. But p-values and their multiplicity corrections have been successfully used in many scientific applications.

I acknowledge that non-collapsibility is even more unnatural (to the point of appearing more consistent with magic than with the natural world). This is why some people have such a viscerally strong reaction against non-collapsible metrics. The fact that it is a mathematical property of certain metrics we have found to be useful does not make non-collapsibility a property of nature. A good parallel is infinity. Infinity comes up in the formulas of many valuable equations successfully used in the natural sciences. But as far as we can tell to date, infinity does not actually exist in nature. Every time we get infinity in a calculation we know that the story is incomplete.

1 Like

What about non-collapsibility is bad?


On the broad discussion, first of all I want to say that this is a truly excellent discussion that we should tweet out from time to time. Second, I want to return to the pragmatic approach. First, a few background observations:

  • If there is only one variable in need of consideration and that variable represents mutually exclusive categories, all link functions (logit, probit, etc.) will yield the same estimated category probabilities.
  • If there is only one variable and it is continuous, fitting the variable flexibly (e.g., using a cubic spline with 4-5 knots) will result in virtually the same estimated probabilities no matter what the link function.
  • Things get more interesting when there is > 1 predictor.

For the > 1 predictor case I think pragmatically about what is a good default starting point. I don’t think for a minute that a default approach will work for all situations/datasets. Data analysis is hard and prior knowledge about data patterns is not frequently available. So we need help. If a particular approach worked best in 8 out of 10 representative datasets I would continue to use it in many cases as a default approach. So what do we mean by “worked best?” Here is the way I think about that. Ignoring nonlinearity for the moment, I would choose 10 moderately large “representative” datasets and four or more link functions such as the identity function (additive risk model), logit (logistic model), log probability (relative risk model), and probit. For each dataset and each link function compute A = likelihood ratio \chi^2 statistic for a model containing all the predictors and all their two-way interactions, and B = the same but omitting all interactions. Compute the ratio B/A, the so-called “adequacy index” in the maximum likelihood chapter of Regression Modeling Strategies. B/A is the relative extent to which interaction terms were needed to explain outcome variation (high values indicating less need for interactions). Display B/A ratios for all datasets and all link functions. See what you learn. I postulate that link functions that carry no restrictions, i.e., logit and probit, will fare the best most of the time. I seek models in which additivity tends to explain the great bulk of outcome variation.

Then whatever link function you prefer, state estimates and uncertainty/compatibility intervals for the parameters and then use the model to estimate absolute risk reductions and relative risks when one predictors is changed and the rest are held at representative values.

4 Likes

This proposal is very similar to what @Sander wrote in his 1979 paper on logistic regression limitations:

Blockquote
A general paradigm can be derived … The necessity of adding complex terms (eg. product terms) to a multiplicative (additive) statistical model in order to achieve a “good fit” may indicate that the exposures under study do not influence the disease risk through a process following the multiplicative (additive) biological model; the form of the best fitting, parsimonious statistical model may in turn suggest a form of the biological model in action… Results of such generalized model searches may be combined with physiologic observations and data from animal experiments to test old and generate new models…

I also don’t understand this preference for “Marginal” models over conditional ones. Direct reference to the Nelder paper I mentioned above:

Blockquote
An advantage of a random effects model is they allow conditional inferences in addition to marginal inferences.… The conditional model is a basic model, which leads to a specific marginal model.

2 Likes

Anders, there is a point of correction or much greater precision needed here: You should say “By using a main-effect only (first-order) logistic model…”. Remember that any good modeler with enough data will consider product terms (“interactions” or second-order terms) that will allow the odds ratio to vary with covariates and may also investigate which link function provides the simplest linear term, as Frank discusses in his comment. Then too, with our modern computing capabilities we can use “machine learners” that start with a highly saturated model and penalize for complexity in a manner that need not shrink toward constancy of the odds ratio.

We need more attention to the distinction between the statistical task of smoothing or data stabilization (which arises because we have noisy, finite data) and the broader task of causal explanation that is part of the search for patterns stable enough for generalization to settings beyond the one that generated the data under analysis, which is what decisions demand (we make decisions for patients that were not in the study). Causal considerations are important for both tasks but become central for decisions.

5 Likes

I agree with this, I definitely should have been clearer. I am not sure this is central to the point I was trying to make, but the sentence on its own should have been clearer.

I prefer to think of the need for interaction terms as arising from the fact that you can often make a more plausible argument for conditional effect measure homogeneity (conditional on some subset of covariates in the model) than for complete effect measure homogeneity. Conditional effect measure homogeneity is still a homogeneity assumption, and subject to the same kind of reasoning about whether the biology is consistent with the choice of link function.

Here is what I mean by that: Suppose you have a model with predictors X, Y and Z, and a product term is needed between X and Z. You can parametrize the logistic model with separate odds ratio parameters for the effect of Z conditional on X=0, and the effect of Z conditional on X=1. This gives your model a structure where the effect of Z conditional on (X=0, Y=0) is equal to the effect of Z conditional on (X=0, Y=1) on the odds ratio scale, and that the effect of Z conditional on (X=1, Y=0) is equal to the effect of Z conditional on (X=1, Y=1) on the odds ratio scale.

This is not the standard parametrization, but the parameters in the standard parametrization are just deterministic functions of the parameters in my parametrization. The advantage of this parametrization is that it makes it easier to appreciate the nature of the conditional effect homogeneity assumptions made by a model with interactions.

In a few months at most, I will share a new preprint with significant more detail about my thoughts on regression models. I look forward to sharing that manuscript with this forum, hopefully it will make perspective clearer to everyone.

1 Like