Question arising from JAMA Guide article "Odds Ratios—Current Best Practice and Use"

In their paper “Odds Ratios – Current Best Practice and Use”, Norton, Dowd, and Maciejewksi argue that one of the lesser known limitations of the odds ratio from a logistic regression is that it is

…scaled by an arbitrary factor (equal to the square root of the variance of the unexplained part of binary outcome).4 This arbitrary scaling factor changes when more or better explanatory variables are added to the logistic regression model because the added variables explain more of the total variation and reduce the unexplained variance. Therefore, adding more independent explanatory variables to the model will increase the odds ratio of the variable of interest (eg, treatment) due to dividing by a smaller scaling factor.

The implication being

Different odds ratios from the same study cannot be compared when the statistical models that result in odds ratio estimates have different explanatory variables because each model has a different arbitrary scaling factor.4-6 Nor can the magnitude of the odds ratio from one study be compared with the magnitude of the odds ratio from another study, because different samples and different model specifications will have different arbitrary scaling factors. A further implication is that the magnitudes of odds ratios of a given association in multiple studies cannot be synthesized in a meta-analysis.4

(Reference 4 listed as #2 below.)

I was surprised by this given that I am taught to treat ORs as transportable. Inspired by all you fine folks, I looked for an example of a simple simulated logistic regression to check for myself.

In R

 sims = 100
 out <- data.frame(treat_1 = rep(NA, sims),
                   treat_2 = rep(NA, sims),
                   treat_3 = rep(NA, sims))
 n = 1000
 for(i in 1:sims){
 x1 = rbinom(n, 1, 0.5)           # Treatment variable
 x2 = rnorm(n)                       # Arbitrary continuous variable
 x3= rnorm(n)                        # Another arbitrary continuous variable
 z = 1 + 2*x1 + 3*x2 + 4*x3        # linear combination
 pr = 1/(1+exp(-z))         # pass through an inv-logit function
 y = rbinom(n,1,pr)      # bernoulli response variable
   #now feed it to glm:
   df = data.frame(y=y,x1=x1,x2=x2, x3 = x3)
   model1 <- glm(y ~ x1, data = df, family = "binomial")
   model2 <- glm(y ~ x1 + x2, data = df, family = "binomial")
   model3 <- glm(y ~ x1 + x2 + x3, data = df, family = "binomial")
  out$treat_1[i] <- model1$coefficients[[2]]
  out$treat_2[i] <- model2$coefficients[[2]]
  out$treat_3[i] <- model3$coefficients[[2]]

True enough, my column means were

  treat_1   treat_2   treat_3 
0.5980825 0.7585841 1.9987460 

So without any interaction between variables but all three being prognostic of outcome, I get three different results for my treatment effect.

So from this, my questions are:

  1. Have I made a mistake or misinterpreation that easily explains these results?
  2. I have been taught that when trialists provide adjusted odds ratios to use those for meta-analysis, but wouldn’t this imply that I would be extracting different odds ratios depending on the number of variables that were adjusted for?
  3. Is meta-analysis of observational trials using odds ratios entirely hopeless since, as stated by Norton et al: “different samples and different model specifications will have different arbitrary scaling factors”?


  1. Odds Ratios—Current Best Practice and Use
    EC Norton, BE Dowd, ML Maciejewski - JAMA, 2018
  2. Log odds and the interpretation of logit models
    EC Norton, BE Dowd - Health services research, 2018 - Wiley Online Library

hi @timdisher - fyi the link to the Norton, Dowd, and Maciejewksi paper you included is routed thru your university portal so won’t work for those without an account there :slight_smile:

You might find this interesting - although it is a bit over my head, I know Anders addresses problems with OR’s in his work: Effect modification and choice of effect measure

1 Like

Thank you! I will edit just to attach them.


Really nice work @timdisher. I think you are right, but the implications are a bit more subtle. If you were to think that unadjusted odds ratios were comparable across studies, this is not the case. That’s because these unadjusted ORs are functions of the (hidden to the analysis) subject characteristics.

I think the biggest mistake people tend to make in this area is criticizing adjusted estimates because you can never measure all the things you really need to adjust for. To that I say that we need to adjust for the most information that is available to us, given the absence of collider bias, looking into the future, etc.

It is true that for hazard ratios, odds ratios, etc. you can’t compare effect ratios across different sets of adjustors. The odds ratios have a fundamentally different meaning when adjustors change.


@timdisher, Thank you for drawing my attention to this article. I read Norton and Dowd’s earlier paper (your reference # 2) with a journal club last year and created a Shiny app to help illustrate the simulation they discuss on pp. 868–870. The only place I have used the app is in that journal club, so it doesn’t have any documentation, but I hope it might still be an interesting companion to that article: Shiny app.


Thank you, this is great!

Totally agree that unadjusted are no better (and almost certainly worse). My impression from these data are that efforts to synthesize data from observational and randomized trials is even harder than the current literature base makes it out to be. Not sure about the fields you work in most commonly, but everyone in my end of neonatology seems to have their own pet characteristics to adjust for. Could this be addressed at all by working from an assumed baseline risk and then converting ORs to RRs or absolute risk?

Seeing you say this so matter-of-factly juxtaposed against the all to common narrative of univariate adjustments followed by adding predictors to show how odds ratios change in response to “adjustment” is a weird combination of funny and frustrating. I wonder if you would mind expanding on the last sentence, is this a marginal vs conditional type scenario?


This is covered in the ANCOVA chapter in BBR where I reference an excellent paper on identifiability problems of hazard ratios for the Cox model.

You can work from an assumed baseline risk if the risk is declared to represent a single subject.

Thank you for bringing this up.

Yes, ORs (and HR) change as we change the # and type of covariates.
What I find striking in this JAMA paper is their assertion about meta-analysis. I w’d think that at least a random-effects meta w’d be acceptable (assuming we have reasonable homogeneity, i.e. studies used similar target populations, adjusted for same covariates, etc). Thoughts?

1 Like

I had the same initial reaction. If not able to combine in meta-analysis under any scenario then how can they be transported anywhere e.g. using prediction models in clinical practice, developing economic models, etc…? The authors seem to suggest that really you can only rely on ORs for direction and stat sig, which seems like a strong statement. Off the read BBR!


Keep in mind that if you adjust for 5 “big” predictors of outcome in one study and a different 5 predictors in another study, but the predicted risk that comes from one set is concordant with the predicted risk that comes from the other predictor set, you will have explained a large bulk of the easily explainable outcome variation, and done so in a way that makes the exposure odds ratios almost comparable.


There is a very old paper by Gail et al that discusses numerical differences in effect measures between adjusted and unadjusted models for GLM like regressions.
This “bias” (which is not really bias in the sense of bias in estimation - just difference in the numerical values of adjusted and un-adjusted .models) underlines the non-collapsibility of ORs, which is the question indirectly asked here.

Because of this non-collapsibility , one cannot combine OR from models with different structures (different predictors). The situation of the HR is somewhat more interesting since in many cases the adjusted and the unadjusted ratios don’t differ by much based on extent of censoring.

The link to the paper

I show some of Gail’s key example in the ANCOVA chapter of BBR.

1 Like

This is 2 years too late but I have had a discussion with Norton via email a year or two ago and essentially he used the Hauck et al example to make a case for scaling factors. Hauck report a data-set with risky behavior as outcome, HIV status as exposure and nyc (New York city versus San Francisco) as a third variable (all binary).

The OR for the effect of hiv status on risky behavior was 2.33
When nyc was added to the model, the OR for the effect of HIV status on risky behavior became 3.0

Norton stated in the email that:
the example in Hauck et al. perfectly illustrates their point and our main point. To review, our main point is that if you add or exclude variables (Hauck et al. call these variables mavericks) from a logistic regression, and those variables are correlated with the outcome but uncorrelated with the exposure, then the coefficient on the exposure will change due to the arbitrary scaling factor. This is exactly what happens when you first run model to predict risky behavior as a function of hiv, and then predict risky behavior as a function of both hiv and nyc (hiv and nyc have correlation of zero). The coefficient on hiv is higher – as expected – when nyc is included for reasons we discuss in our paper and for reasons Hauck discuss in their paper. This is a nice simple example that perfectly illustrates this important point. The two different magnitudes of the OR for hiv cannot be compared, because they are estimated with two different models.

The reality is that there are no scaling factors and no mavericks and this was simply a faulty line of inference. if nyc was prognostic for risky, the OR change from 2.33 to 3 would have been because as prognostic covariates are added on, the OR for HIV status gets closer to its real value (i.e. less systematic error induced by not accounting for the effect of nyc). Norton advises not to use the OR for meta-analysis and this is incorrect. A OR from different studies will have varying degrees of systematic error (inducing heterogeneity) based on the covariates adjusted for in the individual studies but remains firmly the effect size of choice in meta-analysis.

I am not a R user so cannot decipher your simulation but the same logic should apply to your data

I have had to track back and edit the post above. Norton said that HIV and NYC were not correlated but actually it was nyc and risky behavior that were not correlated. Therefore nyc was NOT prognostic for the outcome and by this logic the adjusted effects should not have changed.

So if we look at the data:
Effect of nyc on risky OR=1
Effect of hiv on risky OR=2.33
Effect of both OR(nyc)=0.6 OR(hiv)=3

The only explanation seems to be that hiv is a common effect of both risky and city and the two variable regression gives spurious results due to conditioning on a collider (the collider being hiv not nyc). Thus OR of 2.33 is the right one unless someone can come up with a different explanation.

I think the biggest problem with the implications of the JAMA paper is that readers will resort to something much worse that partially adjusted ORs. People who write guidelines need to always keep alternatives in mind. There is no perfect methods, but there are plenty of methods that are competitive.


Agree fully. One such implication (for example) is people resorting to a GLM with a log link instead of the logit link … :pensive:

Great example of a disaster. Most of the people using a log link don’t bother to even look for the myriad of fake interactions inducted by the model to make it fit right—interactions to make it give estimates that are (properly) similar to estimates from a model with an unrestricted link function (logistic, probit).

1 Like

Looks like I have to backtrack again!
The third variable is nyc and its not associated with risky and so it cannot be either a collider or confounder. The ORs of 0.6 and 3 (nyc and hiv respectively) are correct and so is the marginal OR of 2.33 when nyc is ignored. In this dataset the hiv prevalence varies by city such that the negative effect of city on risky and the positive effect of hiv on risky balance perfectly so that city itself has no association with risky in this dataset but its effect after controlling for hiv is 0.6 and not null.

The proportions are the proportions of risky and numbers are the cell sizes. There is no confounding, colliding nor effect modification and of course no mavericks.

1 Like

I have done some more work on this topic and the Hauck et al paper so I am replacing my last post with this one. Their data example is below:

First let us consider the conclusion by Hauck et al:

a) They say that city is not a confounder (classical associational criterion) and call it a maverick because it meets the change-in estimate criterion
b) They are surprised that the OR at each level of Z (city) is 3 but OR(XY) is 2.33

In response to this, first, Z(city) is indeed a confounder in the classical sense - thus the unconfounded (PS weighted) OR(XY) is 2.95 which is still of course different from the adjusted OR of 3.

Second, There is no reason why, in the absence of confounding and effect modification, OR(XY) should equal OR(XY) | Z since where the patient is resident gives more information about the patient and learning more about the patient (where he/she lives) should change our assessment of the chance that there will be risky behaviour. In this case the chance is conditional on what we know about that patient, and not an absolute property of the patient (See Buis)

As Buis says "We are not uncertain about the persons in our dataset, as we have already observed whether or not they experienced the event we are interested in. Instead, we imagine a person with the same characteristics as a person in the dataset but for whom it is unknown whether or not he or she experienced the event. We use the predicted proportion of persons with those characteristics that experienced the event as our measurement of the degree of plausibility that this new person experiences the event. So the chances are not a property of the units that are being studied, but a property of the researcher: it is her or his assessment of how likely it is that someone with similar characteristics will experience the event. If we look at chance in this way, then it is necessarily conditional on the available information. In logistic regression the “available knowledge” translates to the set explanatory variables included in the model. So the degree of plausibility is defined by the choice of control variables rather than something that exists outside the model. … This is now a desirable property of the logistic model in the sense that the chances we get out of a logistic regression react to new information (new variables) as one would expect. In particular, as one adds new variables, and thus becomes surer about the outcome, the chances will become more extreme".

Thus what Hauck consider a maverick behavior is really non-collapsibility and the latter is desirable because it indicates a true measure of association that is able to discriminate between three groups of patients: NYC alone, SFC alone and the combination of NYC & SFC. As Kuha says (paraphrased), “…the NYC and the SFC patients are different groups of individuals, and different again from the pooled group of both of them together. … we have every reason to expect that the effects in all of these groups are different in magnitude”.