Should one derive risk difference from the odds ratio?

Frank: Why are you taking ratios of LR statistics? They are already logs of the LRs so have been transformed from the ratio scale of the likelihoods to the additive scale of the loglikelihoods. The A-B loglikelihood difference D is the Kullback-Leibler information divergence of B from A (a measure of how much information B loses compared to A); 2D is an LR statistic (with df equal to the number of product terms in A) which provides a P-value for testing B against A.

I can compare the D, their P-values, and their S-values (surprisals) -log2(p) across links to get comparisons scaled in ways I am familiar with from 70 years of literature on model checking and selection. What is added by the B/A ratio? What does it mean in frequentist, Bayesian, likelihoodist or information terms?

I agree with Frank – we urgently need to answer this question: What about non-collapsibility is bad?

Lets take the same example given by @Sander (which is from the data in the JCE paper that I have tabulated above). The overall mortality is 38.33% and OR is 2.75 for masks versus intubation (assuming such a trial was allowed). When a new patient 65 years old comes in and we only have this information, then all we know is that in this population, given that we have a M:F ratio of 1:2, and we refuse to know the gender of the new patient (or any other demographic information), the observed mortality odds increases 2.75 fold. The wife may not know what is odds so a quick calculation can be made and the patient told that 26.7% of the intubated and 50% of the mask patients died and that the risk difference is 23.3% (for every 5 treated with a mask compared to intubation, 1 more patient died at 60 days) or the fold increase in risk was 1.875 (RR) which is almost two-fold. All this is correct and deduced from the OR (note all risk related information was derived solely from the OR using the equations in Doi et al), so I do not see any “fault” of the OR here. Perhaps @Sander can explain further why the OR was bad and the RD was not bad so we understand what we are missing here. In particular please explain why the OR is far higher for this and every single specific patient but the RD was okay even if it is a direct derivation from the OR?

I suspect @Sander will say that in males the OR is increased 6 fold for masks, so if we had this information it is not congruent with the earlier 2.75 increase in odds overall. But why is that a problem? The male gender is a very strong risk factor for death and if we refuse to know how males were distributed at baseline would we not expect the discrimination of deaths by mask to weaken? Again from the OR of 6 (Doi et al), I can tell the wife that death risk went from 60% to 90%, RD was 30% and for every 4 male patients treated with a mask compared to intubation, 1 more patient died at 60 days. This is a risk increase of 1.5 fold. Note the earlier risk increase was 1.875 fold. Obviously the RR dropped when the OR, NNT and RD worsened and given that the OR is a transformation of the C statistic we can clearly deduce that there is no monotonic relationship of the RR with death discrimination.

Your response is remarkable for slipping in information where none was given: Nowhere did I state that the M:F ratio was 1:2 or that the male OR was 6, nor was any other sex specific information given or needed to make the point. Now in the example from which I took the numbers, the male OR was 6.0 and the female OR was 3.9 - both much higher than the overall causal OR of 2.75. Which only further proves the point: Without specific information about prognostic factors in the trial population, we cannot from the total OR alone provide a specific bet or even a betting range for the impact of treatment on the new patient’s odds or risk, even if that patient is a typical patient from the same population on which the trial was conducted. In sharp contrast, the RD alone provides a coherent bet for patient’s absolute risk increase.

If you don’t like the betting formulation, read the stochastic potential-outcome formulation in my 1987 AJE paper, where I show that every single individual can have an OR higher than the group OR (rendering the group OR misleading when applied to individuals) whereas this defect is not present with RRs or RDs, and group RDs are always simple averages of individual RDs. Read also Greenland Robins Pearl (“Confounding and collapsibility in causal inference”. Statistical Science 1999;14:29–46.) [For completeness I add again that hazard rate ratios suffer the problem, but in practice to a much lesser extent than ORs do.]

Again, the problem with ORs has been obvious to many people and explained in many ways over some 40 years, and does not require deep math to see; simple numeric examples suffice. So, as I said way before in this long thread, I cannot ascribe your continuing failure to see it and your responses (which as Anders noted keep slipping in assumptions or irrelevancies) to anything other than a cognitive blockage. Again, I suspect that blockage arises from finding impossible to believe that you overlooked something so obvious, plus a commitment to defend ORs as something special beyond the technical advantages (which we all recognize). Those technical advantages are countered by the amply documented interpretational disadvantage of severe noncollapsibility, which in the end makes the OR a classic example of one of the most fundamental laws of methodology: The NFLP (No-Free-Lunch Principle).

2 Likes

I read the first of Sander’s two new J Clin Epi papers, and like the replies above this paper comes across as “non-collapsibility is bad because it is bad”. In the medical example there is nothing bad about non-collapsibility. What is bad is that the paper the clinician is quoting did not provide a patient-specific effect estimate. If we start by treating “non-collapsibility” as pejorative we tend to end a discussion with it being pejorative.

In the history of development of methods for a 2x2 table the assumption of homogeneity is made (e.g., probability of outcome of everyone on treatment i is p_i). The simplest forms of risk differences, risk ratios, and odds ratios were designed to apply to this situation, not to the case where there is within-group outcome heterogenity. So in my personal view, indexes such as unadjusted odds ratios were designed for the homogeneous case. In the face of heterogeneity they are far less useful.

For the current purpose I could have used the classic likelihood ratio \chi^2 statistic \Delta = A-B. That statistic is proportional to the sample size. I am seeking an index that is sample-size independent. A/B is that. Another approach might be the difference in pseudo R^2; one of the terms would be \exp(-\frac{A}{n}). One could also argue for \exp(-\frac{\Delta}{n}) or just \frac{\Delta}{n} but interpretation depends on the nature of Y. For example, in time-to-event analysis the literature has shown us that the proper n to use in the denominator is somewhere between the number of events and the number of subjects.

3 Likes

First, I agree with @Sander that the OR debate does not require deep math to see; simple numeric examples clearly should suffice and everyone on this blog should be able to follow easily, though I am not sure why the debate has gone on for 40 years!

In summary, the main argument from @Sander (in this post) is that subgroup ORs are wrong because they do not collapse on to the overall OR and subgroup RDs or RRs are right because they collapse on to the overall RD or RR. The implication is that even if we get the outcome model wrong by refusing to account for easily accounted for outcome heterogeneity, the RR/RD result from the wrong model can provide a coherent bet for patient’s relative/absolute risk increase s because the two subgroups collapse on the overall.

Let’s see if this holds up to further scrutiny. If we consider masks as an index test and death as the gold standard, the overall Se and Sp for masks in relation to death is 65.2% and 59.5% respectively. In the male subgroup the Se and Sp are 60% and 80% respectively. The AUC overall is Se+Sp/2 = 0.62 overall and in males is 0.7. Thus there is greater discrimination of death through masks in the male subgroup compared to overall and therefore the OR reflects this, increasing from 2.75 to 6. The numerical RR value however declines from 1.875 to 1.5. Granted that, without specific information about prognostic factors in the trial population we did not know the gender information at baseline; what we did know at baseline was that the RR is collapsible hence it will not reflect within subgroup death-discrimination accurately while the OR is non-collapsible and therefore will reflect within subgroup-death discrimination accurately. Now what shall I tell the wife of the new patient? I could say one of the following:

a. There is at least a 2.75 fold increase in death odds (convert to NNH for her) with masks in this population and since the trialists refused to account for any other patient related factors, we have no additional information for your husband but if we did know subgroup values for people more like him, the subgroup values are likely to indicate the same death-discrimination or greater for people like your husband.

b. There is a 1.875 fold increase in death risk (convert to NNH for her) with masks in this population and since the trialists refused to account for any other patient related factors we have no additional information for your husband but if we did know subgroup values for people more like him, the subgroup values may be inconsistent with actual death-discrimination for people like your husband.

What do I mean by inconsistent? The RR of 1.5 in men is not comparable to 1.875 overall and lacks true comparative meaning – in short it does not imply that the death risk is less in men than in the overall population. On the other hand the OR does have this interpretational consistency. I and everyone else would be keen to see your arguments on this.

At last, Frank, you have stated your perception in a way that looks to me truly colorblind to explicit statements of why noncollapsibility is considered by so many to be bad: It’s bad because it means the subgroup ORs (which is all clinicians know about because that is what is highlighted in reports and quoted by editorials and medical media) can be arbitrarily far from every individual effect, even if all the individual effects are identical and there no uncontrolled study bias (such as confounding). Furthermore, this phenomenon should be expected in practice because there is always unmeasured baseline risk variation (as shown by the fact that individual outcomes vary within our finest measured subgroupings). So the clinician can never supply a patient-specific OR estimate, she can only supply the OR estimate for the finest subgroup she’s seen reported (which is rarely anything finer than sex plus some very crude age grouping like “above/below 60”)

I am thus forced to conclude that you think it not bad that a reported OR will always mislead the user when applied to individual decisions, unless it is combined with other information to transform it into a collapsible measure like the RD or component risks. You need to confirm or deny that impression.

Mathematically, what is going on is that the OR is variation (logically) independent of the baseline risk, which is a technical advantage for model simplicity and fitting but which also means that it conveys less information about individual effects than an RD. The RD has exactly the opposite pairing: It is severely dependent on the baseline risk, which is a technical disadvantage for model simplicity and fitting, but which also means it conveys more information about individual effects (specifically the individual RD must average to the group RD). The RR falls somewhere between these extremes in being not as technically well behaved as the OR but providing some information about individual effects; also not as ill-behaved technically as the RD while providing some less simple information about individual effects.

If after a few generations there is no agreement about noncollapsibility then we should not expect any in your and my remaining lifetimes. What might agree on is that, for clinical applications, the focus of reporting should be not on effect measures but on full reporting of risk functions. I have been advocating that from the beginning of this controversy in the 1970s as have others.

This resolution would however require concessions on your part that (1) people should stop touting the OR as some kind of magic special measure (which is what the papers by Doi et al. and Doi’s comments here come across as doing) and that (2) for those who (unlike you and Doi) do want a measure that represents average individual effects, the OR simply will not suffice if the risk is not always “low” (in the sense explained in my papers, e.g., if 10% deviation from the average is the maximum tolerable error then the OR shouldn’t be used if the odds can exceed 10%).

1 Like

Nobody has ever claimed that collapsibility is sufficient for being a good effect measure (let alone being the generally “correct” effect measure, which is an idea that you need to let go of).

You will get different predictions depending on whether you use RR or RD, and they are both collapsible, so clearly collapsibility is not sufficient.

I do however believe that collapsibility is a necessary criterion for being a plausible effect measure, for reasons that have been explained in detail, both in this thread and in the literature, going back half a century.

1 Like

Is it possible that what is happening is that clinical researchers, who are simultaneously placed in the individual patient decision role as well as the analyst/modelling role, believe that for individual risk modelling must be done on the risk scale, and do not understand the technical advantages of a transformation to the OR for statistical modelling and smoothing?

Frank gave an example I posted above where he combines the RD and OR in a logistic model to come up with a nice nomogram showing how individual risk interacts with treatment (in terms of absolute risk reduction).

Is it also possible that statisticians, only finding themselves in the modelling portion of clinical research, focus more on the technical advantages of the OR, and simply assume clinicians will know how to transform it into a useful metric?

And epidemiologists, being focused on observational research and risk, find problems with the OR due to many factors in study design they cannot control?

1 Like

Doi: I think Anders already commented on the assumptions you slip in here to save the OR with what looks to OR critics like circular reasoning (or worse), e.g., that the clinician can immediately access the assumed information from available research reports, and also has the facility to combine the OR with that information to obtain more comprehendible measures like NNH, and that AUC is a desirable target. One key point is that violation of all these assumptions is the norm, so slipping them in looks to us like cheating to win.

The AUC angle is however a bizarre distraction in another sense: Anyone who knows screening knows that sensitivity and specificity information (which is what AUC is derived from) is woefully inadequate for clinical prediction; predictive values are what is needed and those are exquisitely sensitive to the baseline risk information missing from AUC just as it is missing from OR. Hence the OR correlation with AUC does nothing to enhance the credibility of the OR for us, it just looks like another way to miss our points. The comments about RR keep missing the same points. Those points are explained once more in my reply to Frank, so please read those carefully before replying.

1 Like

R-cubed: Clinicians should not be assumed to be in an analyst/modeling role. They usually have no meaningful training or understanding of data analysis or modeling, and have no time for that. And in terms of medical topics, they rarely have time for reading more than brief summaries given in editorials in their journals and medical newsletters such as Medscape and MedPage, and in continuing education materials (which look like they are summaries of summaries). They are thus critically dependent on what is shown in those summaries, which is not much. Because of lack of time and sheer volume of literature, few can go further and most of the tiny minority that do (usually med faculty) still retain only what is highlighted in abstracts.

Yet I find many statisticians write as if typical clinicians are somehow sophisticated enough to appreciate technicalities that few could begin to comprehend. Perhaps this is because the statisticians are insulated in medical schools dealing with research faculty, who are typically far more conversant in statistics than the vast majority of clinicians (although, as many can attest, clinical faculty are still far more vulnerable to errors of their statistical authorities, as witnessed by continuing use of correlations and “standardized” coefficients as effect measures in some reports).

Finally, I’ll repeat that the OR controversy has absolutely nothing to do with uncontrolled factors in study design (potential bias sources or validity problems). It would arise even if every study was a perfect randomized trial on a random sample of the clinical target population (as in my example).

1 Like

OK then, correct me if I’m wrong but any debate about your choice seems pretty far removed from the current OR controversy and so warrants a separate topic thread. My question about what your A/B index is measuring and how it would perform and be interpreted under various perspectives (frequentist, informationalist, causalist, likelihoodist, Bayesian, etc.) remains unanswered, so if you can provide coverage addressing this admittedly extensive question, please do.

Thank you Sander, I will certainly read your RR comments carefully before replying but this post needs a response too to clarify what you said. It is not just the AUC that is derived from Se and Sp information but also the likelihood ratios (which themselves are odds ratios but not the classical odds ratio – see Doi et al). The predictive values are nothing more than posterior probabilities so how is a posterior probability more adequate than the likelihood ratio and prior probability from which it is derived? Of course posterior probabilities are dependent on prior probabilities (aka baseline risk) the link being the likelihood ratio (aka non classical odds ratio). This is like saying that the risk difference is woefully inadequate because r1 is what is needed and r1 is exquisitely sensitive to r0 (baseline risk) which I am sure you would not agree with.

Every single measure is inadequate in some way for some purposes. This goes back to my comment that the universal problem here is one of highly lossy compression of complex data information down to single summary measures. I saw no one here claiming a posterior probability is more adequate than an LR and prior probability together, in fact that would be absurd since one could derive the post prob from the latter two but not the other way around. I for one would always want to see the LR and prior before seeing the posterior. But that is tangential to my points to Frank and R-cubed about what clinicians see and can digest properly, which are pivotal to this controversy.

Okay, great. So lets change the scenario a bit. r0 is now risk when test is negative, the likelihood ratio of interest is now the diagnostic odds ratio and from these two we can derive the posterior probability under test positive. This I am sure you will agree with based on your quote above.

Then why are you unhappy with the same in the example? Under no-masks the probability is 0.267 (odds 0.3636), the likelihood ratio is 2.75 and we can then compute a posterior odds (1) and probability (0.5). The link between the probabilities under no-mask and mask is thus the OR (aka likelihood ratio). Why is this a “bizarre distraction”?

What I was referring to as a bizarre distraction is the AUC. Again, our problem is that you keep introducing information that was not given in my example followed by computation that is not performable under what was actually given. BTW I often see summaries where your assumptions are not met, or where one could not reasonably expect a clinician or even most researchers to extract them from what was given.

Sander I second your comments relating to modeling. And I concur that RD is a highly clinically useful quantity—a quantity that most patients want, and note that in the majority of situations RD is best estimated from the OR and the control risk.

The granularity argument is singularly unconvincing. I am not interested in a situation at the limit where information content is infinite. I am interested in the pragmatic situation of having maximum information that is feasible to measure. At a minimum we seek age, extent of disease, and baseline measurement of the outcome when sensible. Ideally we want more than that, but if we only have age at least we use that effectively (unlike the majority of clinical trials that only provide crude marginal estimates even with a whopping age effect and a wide age variation in the sample). Related to your first new JCE paper I note that the definition of odds is the ratio of a probability to its complement. Odds is not defined for an individual 0/1 occurrence, so neither is OR.

The fact that an odds ratio is based in incomplete conditioning makes it neither misleading nor unuseful, just as a logistic model that included 5 out of the 7 really important patient factors is still useful.

I will never criticize a measure because it can’t be compared to crude marginal estimates that should not have been published.

Non-collapsible measures such as odds ratios and hazard ratios are indeed plausible effect measures, and the only ones that are halfway capable of being taken out of context.

Fair enough. I got into this because it directly relates to the issue of “best default modeling starting point” in terms of quantifying non-additivity on the assumed link scale. I’ll just leave it that we need a non-sample-size-dependent quantification of lack of fit of additive models. If you don’t like relative LR \chi^2 there are relative R^2 measures included relative explained variation measures that are function only of X\hat{\beta} or \hat{P}, Related to that one might consider relative median absolute prediction error and similar indexes.

2 Likes

Maybe if you keep saying it, it will some day become true

Let’s keep the discussion at a high level, as requested before.

4 Likes

Then Frank you still don’t understand how the granularity argument weighs against the OR. Given we all have to settle for subgroup measures, we need measures that at least represent an average within each subgroup so we have some assurance of errors averaging out over the remaining unobserved within-subgroup effect variation. For this goal the OR fails spectacularly (read my second paper and my 1987 paper), the RR does not too badly and the RD has no problem at all.

“The fact that an odds ratio is based in incomplete conditioning makes it neither misleading nor unuseful, just as a logistic model that included 5 out of the 7 really important patient factors is still useful.”
You are here confusing two distinct issues: (1) whether the logistic model is useful; I never disagreed. (2) whether the odds ratio pulled from a logistic model is good summary of the effects conditional on the covariates if the outcome is common within covariate levels; it isn’t, it’s misleading for reasons many have explained over and over again including right here in this discussion. It seems those unable to grasp this problem are inevitably among those whose training omitted causal models as distinct from statistical ones - which includes most statisticians who graduated in the last century. Well, as has been said - of physics no less - that “science progresses funeral by funeral”; as has been noted regarding “statistical significance”, unfortunately progress in statistics is not as rapid.

“I will never criticize a measure because it can’t be compared to crude marginal estimates that should not have been published.” This sentence illustrates some leading confusions in OR defenses: (1) it confuses crude unadjusted estimates with marginal (population-averaged) adjusted estimates (others have noted this confusion is embedded throughout your discussions). The debate is about marginal adjusted estimates vs. conditional adjusted estimates and how marginal adjusted ORs are not averages of the conditional ORs, whereas marginal adjusted RRs and RDs are averages of conditional RRs and RDs. No one in this debate has called for not adjusting for measured predictive covariates so your use of “crude” here is just another mistake (although many RCTs are published with no such adjustment in the misguided belief that randomization means they don’t have to adjust). (2) Your comment fails to grasp that every conditional estimate is also marginal and unadjusted and thus “crude” with respect to the covariates left out of the model, and there are always many outcome-predictive covariates (unadjusted potential confounders), even in RCTs; so every estimate we have is crude to some degree. This means again we had better use a collapsible measure if we are to have any hope of getting near the average effects within covariate levels; in other words, we need average conditional effects, which ORs do not supply.

1 Like

I have carefully read Frank’s comment and your response but first a clarification: You said that I keep introducing information that was not given in my example followed by computation that is not performable under what was actually given. All computations are from your JCE data tabulated above and all can be verified as coming from that data. You used that for the masks example and I completed the analysis using all the data (not just what you reported in the post).

Coming back to my response, Frank has already responded but I am adding my two cents since some of the queries were about Doi et al. First, I have to agree with Frank, that both papers come across as “non-collapsibility is bad because it is bad” and I say this because the main argument given is that somehow lack of mathematical collapsibility is a horrendous characteristic of an effect measure and all arguments flow from there. JCE is aimed at clinicians who are experts at inferential decision making and so this may not be an acceptable argument to them. I agree with Frank that non-collapsibility could actually be a desirable property of an effect measure if this property means that it retains a monotonic relationship with indices of discrimination for a binary outcome and that is why I brought in diagnostic tests and the AUC as both have a direct mathematical parallel to the OR that cannot be refuted.

In your response to Frank, you have said that non-collapsibility is bad because it means the subgroup ORs can be arbitrarily far from every individual effect, even if all the individual effects are identical and there no uncontrolled study bias (such as confounding). This interpretation is true, but it is not bad. It just follows from the fact that without accounting for outcome heterogeneity, discrimination for an outcome is at its lowest level and thus too is the OR and as we continue to account for heterogeneity this improves. You will have to demonstrate to us that the latter is not the case, to convince us otherwise.

You said that, furthermore, this phenomenon (I assume you mean non-collapsibility) should be expected in practice because there is always unmeasured baseline risk variation (as shown by the fact that individual outcomes vary within our finest measured subgroupings). There is no disagreement here at all and we are not saying otherwise either, but as Frank said, the only reason you then link this to a problem for the OR is because it is non-collapsible and therefore it is bad. This is the main point of contention – what is bad about non-collapsibility in this scenario is what we do not know – please tell us. If you mean that non-collapsibility implies an ever increasing OR that does not collapse as we account for sources of heterogeneity then to us that is good because we expect that to happen as I demonstrated for your example on masks.

Then you say that the clinician can never supply a patient-specific OR estimate, she can only supply the OR estimate for the finest subgroup she’s seen reported (which is rarely anything finer than sex plus some very crude age grouping like “above/below 60”) and that is true – no disagreement at all. What is the implication here is the question because you left it at that. Perhaps you mean that further consideration of heterogeneity of treatment effects will give us a different OR and we agree but why is that bad?

You then say that I am thus forced to conclude that you think it not bad that a reported OR will always mislead the user when applied to individual decisions. No, this is not what we think. We think that it is not bad that a OR is non-collapsible and this is where we differ. Why should a reported OR always mislead a user – are you suggesting that it will mislead because it is non-collapsible and therefore you are back to the same point Frank mentioned – non-collapsibility is bad because it is bad. Why are larger subgroup ORs that do not collapse bad and misleading? I have demonstrated they have to be larger if the factor is prognostic for the outcome because discrimination is incrementally improved. Is that wrong? If so, why?

Then you say that mathematically, what is going on is that the OR is variation (logically) independent of the baseline risk, which is a technical advantage for model simplicity and fitting but which also means that it conveys less information about individual effects than an RD. Why does OR convey less information? If it is because it is non-collapsible then we are back to the same point. My previous analysis of your data in the JCE paper (re-interpreted in terms of masks above) demonstrates that the OR conveys as much information as the RD because both the overall and male specific OR could return both the RR and RD of interest. Why we do not prefer RR and RD is because their numerical values can change across subgroups with different baseline risk independent of their impact on outcome (Doi et al) and this does not happen with non-collapsible measures. As an example the RR for males went lower even though the OR was increased in the mask example.

You say that the RD has exactly the opposite pairing: It is severely dependent on the baseline risk, which is a technical disadvantage for model simplicity and fitting, but which also means it conveys more information about individual effects (specifically the individual RD must average to the group RD). Again, you are back to the same point which is that it is good because it is collapsible and therefore it is more informative because it must average to the group RD – we believe the opposite is true and the mathematical evidence stacks up to this (Doi et al) unless you can point out a flaw in the mathematical expressions we report. One flaw you have suggested previously is bias that McNutt suggested but as pointed out in a letter by Karp, there was no bias – we need to pair the right baseline risk to the right OR.

Coming to the resolutions suggested:

people should stop touting the OR as some kind of magic special measure – we did not give it any special status in the papers – we just demonstrated that it gives the correct posterior probabilities because its numerical value is not impacted by baseline risk and therefore conveys direct information and is less misleading than RR/RD. However we suggested to convert to risks only for interpretation as well but the primary analysis as risks is misleading because of the baseline risk problem.

for those who do want a measure that represents average individual effects, the OR simply will not suffice if the risk is not always “low” (in the sense explained in my papers) . This we disagree with and as pointed out above, the OR produces posterior odds that are accurate in subgroups. You say that 10% deviation from the average is the maximum tolerable error but this only applies if we are interested in a numerical interpretation of the OR as the RR and this is not the case we are arguing for.

1 Like