Should one derive risk difference from the odds ratio?

Sounds like you may get into coding options for regressors. If so I’ll refer to Ch. 20 of Modern Epidemiology 3rd ed 2008, p. 401-408.

1 Like

This is for me a strange question given that there are now 40 years of articles trying to explain why it is bad. Even more strange is that this is a very subtle numeric phenomenon, yet in my experience it seems more easily appreciated as a problem by clinicians than by statisticians.

I’ll give an illustrative example based on the math in Greenland (“Interpretation and choice of effect measures in epidemiologic analysis”. Am J Epidemiol 1987;125:761–8): Suppose (quite hypothetically and simplified of course) an attending physician at an ICU has just been given the abstract from a large RCT done on a random sample of all covid ICU admissions, with the treatment being oxygen via mask alone vs. intubation+sedation, and the outcome being 60-day mortality. The abstract says the average age of participants was 65, that 38% of them died within 60 days, and the observed overall odds ratio was 2.75, but results are not given for subgroups.

The physician is then confronted with a new covid ICU admission of a 65-year old man who is accompanied by a wife concerned about the long-term cognitive effects of intubation+sedation and asks for mask alone. Should the physician tell her that this will increase the patient’s odds of death by a factor of 2.75? If so that would be wrong because the risk of death varies tremendously across those known covariates, and consequently the increase in odds is likely far higher than 2.75 for this and every single specific patient! But given only the OR, the physician would be unable to say anything more precise such as how much higher.

In contrast, given that the RD was 23%, the physician could at least say that on average the individual increases were 23%, and so if this patient is not visibly atypical of trial participants then 23% would be a reasonable expectation for the risk increase from refusing intubation. In this way, the OR is much less informative about subgroups and individuals than is the RD.

For further numeric discussion see the last two articles (“Noncollapsibility, confounding, and sparse-data bias” parts 1 and 2) on this JCE special issue page: Journal of Clinical Epidemiology | Key Concepts in Clinical Epidemiology | ScienceDirect.com by Elsevier

BTW our response to the latest from Doi et al. is posted at [2106.02673] Odds Ratios are far from "portable": A call to use realistic models for effect variation in meta-analysis

2 Likes

Frank: Why are you taking ratios of LR statistics? They are already logs of the LRs so have been transformed from the ratio scale of the likelihoods to the additive scale of the loglikelihoods. The A-B loglikelihood difference D is the Kullback-Leibler information divergence of B from A (a measure of how much information B loses compared to A); 2D is an LR statistic (with df equal to the number of product terms in A) which provides a P-value for testing B against A.

I can compare the D, their P-values, and their S-values (surprisals) -log2(p) across links to get comparisons scaled in ways I am familiar with from 70 years of literature on model checking and selection. What is added by the B/A ratio? What does it mean in frequentist, Bayesian, likelihoodist or information terms?

I agree with Frank – we urgently need to answer this question: What about non-collapsibility is bad?

Lets take the same example given by @Sander (which is from the data in the JCE paper that I have tabulated above). The overall mortality is 38.33% and OR is 2.75 for masks versus intubation (assuming such a trial was allowed). When a new patient 65 years old comes in and we only have this information, then all we know is that in this population, given that we have a M:F ratio of 1:2, and we refuse to know the gender of the new patient (or any other demographic information), the observed mortality odds increases 2.75 fold. The wife may not know what is odds so a quick calculation can be made and the patient told that 26.7% of the intubated and 50% of the mask patients died and that the risk difference is 23.3% (for every 5 treated with a mask compared to intubation, 1 more patient died at 60 days) or the fold increase in risk was 1.875 (RR) which is almost two-fold. All this is correct and deduced from the OR (note all risk related information was derived solely from the OR using the equations in Doi et al), so I do not see any “fault” of the OR here. Perhaps @Sander can explain further why the OR was bad and the RD was not bad so we understand what we are missing here. In particular please explain why the OR is far higher for this and every single specific patient but the RD was okay even if it is a direct derivation from the OR?

I suspect @Sander will say that in males the OR is increased 6 fold for masks, so if we had this information it is not congruent with the earlier 2.75 increase in odds overall. But why is that a problem? The male gender is a very strong risk factor for death and if we refuse to know how males were distributed at baseline would we not expect the discrimination of deaths by mask to weaken? Again from the OR of 6 (Doi et al), I can tell the wife that death risk went from 60% to 90%, RD was 30% and for every 4 male patients treated with a mask compared to intubation, 1 more patient died at 60 days. This is a risk increase of 1.5 fold. Note the earlier risk increase was 1.875 fold. Obviously the RR dropped when the OR, NNT and RD worsened and given that the OR is a transformation of the C statistic we can clearly deduce that there is no monotonic relationship of the RR with death discrimination.

Your response is remarkable for slipping in information where none was given: Nowhere did I state that the M:F ratio was 1:2 or that the male OR was 6, nor was any other sex specific information given or needed to make the point. Now in the example from which I took the numbers, the male OR was 6.0 and the female OR was 3.9 - both much higher than the overall causal OR of 2.75. Which only further proves the point: Without specific information about prognostic factors in the trial population, we cannot from the total OR alone provide a specific bet or even a betting range for the impact of treatment on the new patient’s odds or risk, even if that patient is a typical patient from the same population on which the trial was conducted. In sharp contrast, the RD alone provides a coherent bet for patient’s absolute risk increase.

If you don’t like the betting formulation, read the stochastic potential-outcome formulation in my 1987 AJE paper, where I show that every single individual can have an OR higher than the group OR (rendering the group OR misleading when applied to individuals) whereas this defect is not present with RRs or RDs, and group RDs are always simple averages of individual RDs. Read also Greenland Robins Pearl (“Confounding and collapsibility in causal inference”. Statistical Science 1999;14:29–46.) [For completeness I add again that hazard rate ratios suffer the problem, but in practice to a much lesser extent than ORs do.]

Again, the problem with ORs has been obvious to many people and explained in many ways over some 40 years, and does not require deep math to see; simple numeric examples suffice. So, as I said way before in this long thread, I cannot ascribe your continuing failure to see it and your responses (which as Anders noted keep slipping in assumptions or irrelevancies) to anything other than a cognitive blockage. Again, I suspect that blockage arises from finding impossible to believe that you overlooked something so obvious, plus a commitment to defend ORs as something special beyond the technical advantages (which we all recognize). Those technical advantages are countered by the amply documented interpretational disadvantage of severe noncollapsibility, which in the end makes the OR a classic example of one of the most fundamental laws of methodology: The NFLP (No-Free-Lunch Principle).

2 Likes

I read the first of Sander’s two new J Clin Epi papers, and like the replies above this paper comes across as “non-collapsibility is bad because it is bad”. In the medical example there is nothing bad about non-collapsibility. What is bad is that the paper the clinician is quoting did not provide a patient-specific effect estimate. If we start by treating “non-collapsibility” as pejorative we tend to end a discussion with it being pejorative.

In the history of development of methods for a 2x2 table the assumption of homogeneity is made (e.g., probability of outcome of everyone on treatment i is p_i). The simplest forms of risk differences, risk ratios, and odds ratios were designed to apply to this situation, not to the case where there is within-group outcome heterogenity. So in my personal view, indexes such as unadjusted odds ratios were designed for the homogeneous case. In the face of heterogeneity they are far less useful.

For the current purpose I could have used the classic likelihood ratio \chi^2 statistic \Delta = A-B. That statistic is proportional to the sample size. I am seeking an index that is sample-size independent. A/B is that. Another approach might be the difference in pseudo R^2; one of the terms would be \exp(-\frac{A}{n}). One could also argue for \exp(-\frac{\Delta}{n}) or just \frac{\Delta}{n} but interpretation depends on the nature of Y. For example, in time-to-event analysis the literature has shown us that the proper n to use in the denominator is somewhere between the number of events and the number of subjects.

3 Likes

First, I agree with @Sander that the OR debate does not require deep math to see; simple numeric examples clearly should suffice and everyone on this blog should be able to follow easily, though I am not sure why the debate has gone on for 40 years!

In summary, the main argument from @Sander (in this post) is that subgroup ORs are wrong because they do not collapse on to the overall OR and subgroup RDs or RRs are right because they collapse on to the overall RD or RR. The implication is that even if we get the outcome model wrong by refusing to account for easily accounted for outcome heterogeneity, the RR/RD result from the wrong model can provide a coherent bet for patient’s relative/absolute risk increase s because the two subgroups collapse on the overall.

Let’s see if this holds up to further scrutiny. If we consider masks as an index test and death as the gold standard, the overall Se and Sp for masks in relation to death is 65.2% and 59.5% respectively. In the male subgroup the Se and Sp are 60% and 80% respectively. The AUC overall is Se+Sp/2 = 0.62 overall and in males is 0.7. Thus there is greater discrimination of death through masks in the male subgroup compared to overall and therefore the OR reflects this, increasing from 2.75 to 6. The numerical RR value however declines from 1.875 to 1.5. Granted that, without specific information about prognostic factors in the trial population we did not know the gender information at baseline; what we did know at baseline was that the RR is collapsible hence it will not reflect within subgroup death-discrimination accurately while the OR is non-collapsible and therefore will reflect within subgroup-death discrimination accurately. Now what shall I tell the wife of the new patient? I could say one of the following:

a. There is at least a 2.75 fold increase in death odds (convert to NNH for her) with masks in this population and since the trialists refused to account for any other patient related factors, we have no additional information for your husband but if we did know subgroup values for people more like him, the subgroup values are likely to indicate the same death-discrimination or greater for people like your husband.

b. There is a 1.875 fold increase in death risk (convert to NNH for her) with masks in this population and since the trialists refused to account for any other patient related factors we have no additional information for your husband but if we did know subgroup values for people more like him, the subgroup values may be inconsistent with actual death-discrimination for people like your husband.

What do I mean by inconsistent? The RR of 1.5 in men is not comparable to 1.875 overall and lacks true comparative meaning – in short it does not imply that the death risk is less in men than in the overall population. On the other hand the OR does have this interpretational consistency. I and everyone else would be keen to see your arguments on this.

At last, Frank, you have stated your perception in a way that looks to me truly colorblind to explicit statements of why noncollapsibility is considered by so many to be bad: It’s bad because it means the subgroup ORs (which is all clinicians know about because that is what is highlighted in reports and quoted by editorials and medical media) can be arbitrarily far from every individual effect, even if all the individual effects are identical and there no uncontrolled study bias (such as confounding). Furthermore, this phenomenon should be expected in practice because there is always unmeasured baseline risk variation (as shown by the fact that individual outcomes vary within our finest measured subgroupings). So the clinician can never supply a patient-specific OR estimate, she can only supply the OR estimate for the finest subgroup she’s seen reported (which is rarely anything finer than sex plus some very crude age grouping like “above/below 60”)

I am thus forced to conclude that you think it not bad that a reported OR will always mislead the user when applied to individual decisions, unless it is combined with other information to transform it into a collapsible measure like the RD or component risks. You need to confirm or deny that impression.

Mathematically, what is going on is that the OR is variation (logically) independent of the baseline risk, which is a technical advantage for model simplicity and fitting but which also means that it conveys less information about individual effects than an RD. The RD has exactly the opposite pairing: It is severely dependent on the baseline risk, which is a technical disadvantage for model simplicity and fitting, but which also means it conveys more information about individual effects (specifically the individual RD must average to the group RD). The RR falls somewhere between these extremes in being not as technically well behaved as the OR but providing some information about individual effects; also not as ill-behaved technically as the RD while providing some less simple information about individual effects.

If after a few generations there is no agreement about noncollapsibility then we should not expect any in your and my remaining lifetimes. What might agree on is that, for clinical applications, the focus of reporting should be not on effect measures but on full reporting of risk functions. I have been advocating that from the beginning of this controversy in the 1970s as have others.

This resolution would however require concessions on your part that (1) people should stop touting the OR as some kind of magic special measure (which is what the papers by Doi et al. and Doi’s comments here come across as doing) and that (2) for those who (unlike you and Doi) do want a measure that represents average individual effects, the OR simply will not suffice if the risk is not always “low” (in the sense explained in my papers, e.g., if 10% deviation from the average is the maximum tolerable error then the OR shouldn’t be used if the odds can exceed 10%).

1 Like

Nobody has ever claimed that collapsibility is sufficient for being a good effect measure (let alone being the generally “correct” effect measure, which is an idea that you need to let go of).

You will get different predictions depending on whether you use RR or RD, and they are both collapsible, so clearly collapsibility is not sufficient.

I do however believe that collapsibility is a necessary criterion for being a plausible effect measure, for reasons that have been explained in detail, both in this thread and in the literature, going back half a century.

1 Like

Is it possible that what is happening is that clinical researchers, who are simultaneously placed in the individual patient decision role as well as the analyst/modelling role, believe that for individual risk modelling must be done on the risk scale, and do not understand the technical advantages of a transformation to the OR for statistical modelling and smoothing?

Frank gave an example I posted above where he combines the RD and OR in a logistic model to come up with a nice nomogram showing how individual risk interacts with treatment (in terms of absolute risk reduction).

Is it also possible that statisticians, only finding themselves in the modelling portion of clinical research, focus more on the technical advantages of the OR, and simply assume clinicians will know how to transform it into a useful metric?

And epidemiologists, being focused on observational research and risk, find problems with the OR due to many factors in study design they cannot control?

1 Like

Doi: I think Anders already commented on the assumptions you slip in here to save the OR with what looks to OR critics like circular reasoning (or worse), e.g., that the clinician can immediately access the assumed information from available research reports, and also has the facility to combine the OR with that information to obtain more comprehendible measures like NNH, and that AUC is a desirable target. One key point is that violation of all these assumptions is the norm, so slipping them in looks to us like cheating to win.

The AUC angle is however a bizarre distraction in another sense: Anyone who knows screening knows that sensitivity and specificity information (which is what AUC is derived from) is woefully inadequate for clinical prediction; predictive values are what is needed and those are exquisitely sensitive to the baseline risk information missing from AUC just as it is missing from OR. Hence the OR correlation with AUC does nothing to enhance the credibility of the OR for us, it just looks like another way to miss our points. The comments about RR keep missing the same points. Those points are explained once more in my reply to Frank, so please read those carefully before replying.

1 Like

R-cubed: Clinicians should not be assumed to be in an analyst/modeling role. They usually have no meaningful training or understanding of data analysis or modeling, and have no time for that. And in terms of medical topics, they rarely have time for reading more than brief summaries given in editorials in their journals and medical newsletters such as Medscape and MedPage, and in continuing education materials (which look like they are summaries of summaries). They are thus critically dependent on what is shown in those summaries, which is not much. Because of lack of time and sheer volume of literature, few can go further and most of the tiny minority that do (usually med faculty) still retain only what is highlighted in abstracts.

Yet I find many statisticians write as if typical clinicians are somehow sophisticated enough to appreciate technicalities that few could begin to comprehend. Perhaps this is because the statisticians are insulated in medical schools dealing with research faculty, who are typically far more conversant in statistics than the vast majority of clinicians (although, as many can attest, clinical faculty are still far more vulnerable to errors of their statistical authorities, as witnessed by continuing use of correlations and “standardized” coefficients as effect measures in some reports).

Finally, I’ll repeat that the OR controversy has absolutely nothing to do with uncontrolled factors in study design (potential bias sources or validity problems). It would arise even if every study was a perfect randomized trial on a random sample of the clinical target population (as in my example).

1 Like

OK then, correct me if I’m wrong but any debate about your choice seems pretty far removed from the current OR controversy and so warrants a separate topic thread. My question about what your A/B index is measuring and how it would perform and be interpreted under various perspectives (frequentist, informationalist, causalist, likelihoodist, Bayesian, etc.) remains unanswered, so if you can provide coverage addressing this admittedly extensive question, please do.

Thank you Sander, I will certainly read your RR comments carefully before replying but this post needs a response too to clarify what you said. It is not just the AUC that is derived from Se and Sp information but also the likelihood ratios (which themselves are odds ratios but not the classical odds ratio – see Doi et al). The predictive values are nothing more than posterior probabilities so how is a posterior probability more adequate than the likelihood ratio and prior probability from which it is derived? Of course posterior probabilities are dependent on prior probabilities (aka baseline risk) the link being the likelihood ratio (aka non classical odds ratio). This is like saying that the risk difference is woefully inadequate because r1 is what is needed and r1 is exquisitely sensitive to r0 (baseline risk) which I am sure you would not agree with.

Every single measure is inadequate in some way for some purposes. This goes back to my comment that the universal problem here is one of highly lossy compression of complex data information down to single summary measures. I saw no one here claiming a posterior probability is more adequate than an LR and prior probability together, in fact that would be absurd since one could derive the post prob from the latter two but not the other way around. I for one would always want to see the LR and prior before seeing the posterior. But that is tangential to my points to Frank and R-cubed about what clinicians see and can digest properly, which are pivotal to this controversy.

Okay, great. So lets change the scenario a bit. r0 is now risk when test is negative, the likelihood ratio of interest is now the diagnostic odds ratio and from these two we can derive the posterior probability under test positive. This I am sure you will agree with based on your quote above.

Then why are you unhappy with the same in the example? Under no-masks the probability is 0.267 (odds 0.3636), the likelihood ratio is 2.75 and we can then compute a posterior odds (1) and probability (0.5). The link between the probabilities under no-mask and mask is thus the OR (aka likelihood ratio). Why is this a “bizarre distraction”?

What I was referring to as a bizarre distraction is the AUC. Again, our problem is that you keep introducing information that was not given in my example followed by computation that is not performable under what was actually given. BTW I often see summaries where your assumptions are not met, or where one could not reasonably expect a clinician or even most researchers to extract them from what was given.

Sander I second your comments relating to modeling. And I concur that RD is a highly clinically useful quantity—a quantity that most patients want, and note that in the majority of situations RD is best estimated from the OR and the control risk.

The granularity argument is singularly unconvincing. I am not interested in a situation at the limit where information content is infinite. I am interested in the pragmatic situation of having maximum information that is feasible to measure. At a minimum we seek age, extent of disease, and baseline measurement of the outcome when sensible. Ideally we want more than that, but if we only have age at least we use that effectively (unlike the majority of clinical trials that only provide crude marginal estimates even with a whopping age effect and a wide age variation in the sample). Related to your first new JCE paper I note that the definition of odds is the ratio of a probability to its complement. Odds is not defined for an individual 0/1 occurrence, so neither is OR.

The fact that an odds ratio is based in incomplete conditioning makes it neither misleading nor unuseful, just as a logistic model that included 5 out of the 7 really important patient factors is still useful.

I will never criticize a measure because it can’t be compared to crude marginal estimates that should not have been published.

Non-collapsible measures such as odds ratios and hazard ratios are indeed plausible effect measures, and the only ones that are halfway capable of being taken out of context.

Fair enough. I got into this because it directly relates to the issue of “best default modeling starting point” in terms of quantifying non-additivity on the assumed link scale. I’ll just leave it that we need a non-sample-size-dependent quantification of lack of fit of additive models. If you don’t like relative LR \chi^2 there are relative R^2 measures included relative explained variation measures that are function only of X\hat{\beta} or \hat{P}, Related to that one might consider relative median absolute prediction error and similar indexes.

2 Likes

Maybe if you keep saying it, it will some day become true

Let’s keep the discussion at a high level, as requested before.

4 Likes