Should one derive risk difference from the odds ratio?

First, I agree with @Sander that the OR debate does not require deep math to see; simple numeric examples clearly should suffice and everyone on this blog should be able to follow easily, though I am not sure why the debate has gone on for 40 years!

In summary, the main argument from @Sander (in this post) is that subgroup ORs are wrong because they do not collapse on to the overall OR and subgroup RDs or RRs are right because they collapse on to the overall RD or RR. The implication is that even if we get the outcome model wrong by refusing to account for easily accounted for outcome heterogeneity, the RR/RD result from the wrong model can provide a coherent bet for patient’s relative/absolute risk increase s because the two subgroups collapse on the overall.

Let’s see if this holds up to further scrutiny. If we consider masks as an index test and death as the gold standard, the overall Se and Sp for masks in relation to death is 65.2% and 59.5% respectively. In the male subgroup the Se and Sp are 60% and 80% respectively. The AUC overall is Se+Sp/2 = 0.62 overall and in males is 0.7. Thus there is greater discrimination of death through masks in the male subgroup compared to overall and therefore the OR reflects this, increasing from 2.75 to 6. The numerical RR value however declines from 1.875 to 1.5. Granted that, without specific information about prognostic factors in the trial population we did not know the gender information at baseline; what we did know at baseline was that the RR is collapsible hence it will not reflect within subgroup death-discrimination accurately while the OR is non-collapsible and therefore will reflect within subgroup-death discrimination accurately. Now what shall I tell the wife of the new patient? I could say one of the following:

a. There is at least a 2.75 fold increase in death odds (convert to NNH for her) with masks in this population and since the trialists refused to account for any other patient related factors, we have no additional information for your husband but if we did know subgroup values for people more like him, the subgroup values are likely to indicate the same death-discrimination or greater for people like your husband.

b. There is a 1.875 fold increase in death risk (convert to NNH for her) with masks in this population and since the trialists refused to account for any other patient related factors we have no additional information for your husband but if we did know subgroup values for people more like him, the subgroup values may be inconsistent with actual death-discrimination for people like your husband.

What do I mean by inconsistent? The RR of 1.5 in men is not comparable to 1.875 overall and lacks true comparative meaning – in short it does not imply that the death risk is less in men than in the overall population. On the other hand the OR does have this interpretational consistency. I and everyone else would be keen to see your arguments on this.

At last, Frank, you have stated your perception in a way that looks to me truly colorblind to explicit statements of why noncollapsibility is considered by so many to be bad: It’s bad because it means the subgroup ORs (which is all clinicians know about because that is what is highlighted in reports and quoted by editorials and medical media) can be arbitrarily far from every individual effect, even if all the individual effects are identical and there no uncontrolled study bias (such as confounding). Furthermore, this phenomenon should be expected in practice because there is always unmeasured baseline risk variation (as shown by the fact that individual outcomes vary within our finest measured subgroupings). So the clinician can never supply a patient-specific OR estimate, she can only supply the OR estimate for the finest subgroup she’s seen reported (which is rarely anything finer than sex plus some very crude age grouping like “above/below 60”)

I am thus forced to conclude that you think it not bad that a reported OR will always mislead the user when applied to individual decisions, unless it is combined with other information to transform it into a collapsible measure like the RD or component risks. You need to confirm or deny that impression.

Mathematically, what is going on is that the OR is variation (logically) independent of the baseline risk, which is a technical advantage for model simplicity and fitting but which also means that it conveys less information about individual effects than an RD. The RD has exactly the opposite pairing: It is severely dependent on the baseline risk, which is a technical disadvantage for model simplicity and fitting, but which also means it conveys more information about individual effects (specifically the individual RD must average to the group RD). The RR falls somewhere between these extremes in being not as technically well behaved as the OR but providing some information about individual effects; also not as ill-behaved technically as the RD while providing some less simple information about individual effects.

If after a few generations there is no agreement about noncollapsibility then we should not expect any in your and my remaining lifetimes. What might agree on is that, for clinical applications, the focus of reporting should be not on effect measures but on full reporting of risk functions. I have been advocating that from the beginning of this controversy in the 1970s as have others.

This resolution would however require concessions on your part that (1) people should stop touting the OR as some kind of magic special measure (which is what the papers by Doi et al. and Doi’s comments here come across as doing) and that (2) for those who (unlike you and Doi) do want a measure that represents average individual effects, the OR simply will not suffice if the risk is not always “low” (in the sense explained in my papers, e.g., if 10% deviation from the average is the maximum tolerable error then the OR shouldn’t be used if the odds can exceed 10%).

1 Like

Nobody has ever claimed that collapsibility is sufficient for being a good effect measure (let alone being the generally “correct” effect measure, which is an idea that you need to let go of).

You will get different predictions depending on whether you use RR or RD, and they are both collapsible, so clearly collapsibility is not sufficient.

I do however believe that collapsibility is a necessary criterion for being a plausible effect measure, for reasons that have been explained in detail, both in this thread and in the literature, going back half a century.

1 Like

Is it possible that what is happening is that clinical researchers, who are simultaneously placed in the individual patient decision role as well as the analyst/modelling role, believe that for individual risk modelling must be done on the risk scale, and do not understand the technical advantages of a transformation to the OR for statistical modelling and smoothing?

Frank gave an example I posted above where he combines the RD and OR in a logistic model to come up with a nice nomogram showing how individual risk interacts with treatment (in terms of absolute risk reduction).

Is it also possible that statisticians, only finding themselves in the modelling portion of clinical research, focus more on the technical advantages of the OR, and simply assume clinicians will know how to transform it into a useful metric?

And epidemiologists, being focused on observational research and risk, find problems with the OR due to many factors in study design they cannot control?

1 Like

Doi: I think Anders already commented on the assumptions you slip in here to save the OR with what looks to OR critics like circular reasoning (or worse), e.g., that the clinician can immediately access the assumed information from available research reports, and also has the facility to combine the OR with that information to obtain more comprehendible measures like NNH, and that AUC is a desirable target. One key point is that violation of all these assumptions is the norm, so slipping them in looks to us like cheating to win.

The AUC angle is however a bizarre distraction in another sense: Anyone who knows screening knows that sensitivity and specificity information (which is what AUC is derived from) is woefully inadequate for clinical prediction; predictive values are what is needed and those are exquisitely sensitive to the baseline risk information missing from AUC just as it is missing from OR. Hence the OR correlation with AUC does nothing to enhance the credibility of the OR for us, it just looks like another way to miss our points. The comments about RR keep missing the same points. Those points are explained once more in my reply to Frank, so please read those carefully before replying.

1 Like

R-cubed: Clinicians should not be assumed to be in an analyst/modeling role. They usually have no meaningful training or understanding of data analysis or modeling, and have no time for that. And in terms of medical topics, they rarely have time for reading more than brief summaries given in editorials in their journals and medical newsletters such as Medscape and MedPage, and in continuing education materials (which look like they are summaries of summaries). They are thus critically dependent on what is shown in those summaries, which is not much. Because of lack of time and sheer volume of literature, few can go further and most of the tiny minority that do (usually med faculty) still retain only what is highlighted in abstracts.

Yet I find many statisticians write as if typical clinicians are somehow sophisticated enough to appreciate technicalities that few could begin to comprehend. Perhaps this is because the statisticians are insulated in medical schools dealing with research faculty, who are typically far more conversant in statistics than the vast majority of clinicians (although, as many can attest, clinical faculty are still far more vulnerable to errors of their statistical authorities, as witnessed by continuing use of correlations and “standardized” coefficients as effect measures in some reports).

Finally, I’ll repeat that the OR controversy has absolutely nothing to do with uncontrolled factors in study design (potential bias sources or validity problems). It would arise even if every study was a perfect randomized trial on a random sample of the clinical target population (as in my example).

1 Like

OK then, correct me if I’m wrong but any debate about your choice seems pretty far removed from the current OR controversy and so warrants a separate topic thread. My question about what your A/B index is measuring and how it would perform and be interpreted under various perspectives (frequentist, informationalist, causalist, likelihoodist, Bayesian, etc.) remains unanswered, so if you can provide coverage addressing this admittedly extensive question, please do.

Thank you Sander, I will certainly read your RR comments carefully before replying but this post needs a response too to clarify what you said. It is not just the AUC that is derived from Se and Sp information but also the likelihood ratios (which themselves are odds ratios but not the classical odds ratio – see Doi et al). The predictive values are nothing more than posterior probabilities so how is a posterior probability more adequate than the likelihood ratio and prior probability from which it is derived? Of course posterior probabilities are dependent on prior probabilities (aka baseline risk) the link being the likelihood ratio (aka non classical odds ratio). This is like saying that the risk difference is woefully inadequate because r1 is what is needed and r1 is exquisitely sensitive to r0 (baseline risk) which I am sure you would not agree with.

Every single measure is inadequate in some way for some purposes. This goes back to my comment that the universal problem here is one of highly lossy compression of complex data information down to single summary measures. I saw no one here claiming a posterior probability is more adequate than an LR and prior probability together, in fact that would be absurd since one could derive the post prob from the latter two but not the other way around. I for one would always want to see the LR and prior before seeing the posterior. But that is tangential to my points to Frank and R-cubed about what clinicians see and can digest properly, which are pivotal to this controversy.

Okay, great. So lets change the scenario a bit. r0 is now risk when test is negative, the likelihood ratio of interest is now the diagnostic odds ratio and from these two we can derive the posterior probability under test positive. This I am sure you will agree with based on your quote above.

Then why are you unhappy with the same in the example? Under no-masks the probability is 0.267 (odds 0.3636), the likelihood ratio is 2.75 and we can then compute a posterior odds (1) and probability (0.5). The link between the probabilities under no-mask and mask is thus the OR (aka likelihood ratio). Why is this a “bizarre distraction”?

What I was referring to as a bizarre distraction is the AUC. Again, our problem is that you keep introducing information that was not given in my example followed by computation that is not performable under what was actually given. BTW I often see summaries where your assumptions are not met, or where one could not reasonably expect a clinician or even most researchers to extract them from what was given.

Sander I second your comments relating to modeling. And I concur that RD is a highly clinically useful quantity—a quantity that most patients want, and note that in the majority of situations RD is best estimated from the OR and the control risk.

The granularity argument is singularly unconvincing. I am not interested in a situation at the limit where information content is infinite. I am interested in the pragmatic situation of having maximum information that is feasible to measure. At a minimum we seek age, extent of disease, and baseline measurement of the outcome when sensible. Ideally we want more than that, but if we only have age at least we use that effectively (unlike the majority of clinical trials that only provide crude marginal estimates even with a whopping age effect and a wide age variation in the sample). Related to your first new JCE paper I note that the definition of odds is the ratio of a probability to its complement. Odds is not defined for an individual 0/1 occurrence, so neither is OR.

The fact that an odds ratio is based in incomplete conditioning makes it neither misleading nor unuseful, just as a logistic model that included 5 out of the 7 really important patient factors is still useful.

I will never criticize a measure because it can’t be compared to crude marginal estimates that should not have been published.

Non-collapsible measures such as odds ratios and hazard ratios are indeed plausible effect measures, and the only ones that are halfway capable of being taken out of context.

Fair enough. I got into this because it directly relates to the issue of “best default modeling starting point” in terms of quantifying non-additivity on the assumed link scale. I’ll just leave it that we need a non-sample-size-dependent quantification of lack of fit of additive models. If you don’t like relative LR \chi^2 there are relative R^2 measures included relative explained variation measures that are function only of X\hat{\beta} or \hat{P}, Related to that one might consider relative median absolute prediction error and similar indexes.

2 Likes

Maybe if you keep saying it, it will some day become true

Let’s keep the discussion at a high level, as requested before.

4 Likes

Then Frank you still don’t understand how the granularity argument weighs against the OR. Given we all have to settle for subgroup measures, we need measures that at least represent an average within each subgroup so we have some assurance of errors averaging out over the remaining unobserved within-subgroup effect variation. For this goal the OR fails spectacularly (read my second paper and my 1987 paper), the RR does not too badly and the RD has no problem at all.

“The fact that an odds ratio is based in incomplete conditioning makes it neither misleading nor unuseful, just as a logistic model that included 5 out of the 7 really important patient factors is still useful.”
You are here confusing two distinct issues: (1) whether the logistic model is useful; I never disagreed. (2) whether the odds ratio pulled from a logistic model is good summary of the effects conditional on the covariates if the outcome is common within covariate levels; it isn’t, it’s misleading for reasons many have explained over and over again including right here in this discussion. It seems those unable to grasp this problem are inevitably among those whose training omitted causal models as distinct from statistical ones - which includes most statisticians who graduated in the last century. Well, as has been said - of physics no less - that “science progresses funeral by funeral”; as has been noted regarding “statistical significance”, unfortunately progress in statistics is not as rapid.

“I will never criticize a measure because it can’t be compared to crude marginal estimates that should not have been published.” This sentence illustrates some leading confusions in OR defenses: (1) it confuses crude unadjusted estimates with marginal (population-averaged) adjusted estimates (others have noted this confusion is embedded throughout your discussions). The debate is about marginal adjusted estimates vs. conditional adjusted estimates and how marginal adjusted ORs are not averages of the conditional ORs, whereas marginal adjusted RRs and RDs are averages of conditional RRs and RDs. No one in this debate has called for not adjusting for measured predictive covariates so your use of “crude” here is just another mistake (although many RCTs are published with no such adjustment in the misguided belief that randomization means they don’t have to adjust). (2) Your comment fails to grasp that every conditional estimate is also marginal and unadjusted and thus “crude” with respect to the covariates left out of the model, and there are always many outcome-predictive covariates (unadjusted potential confounders), even in RCTs; so every estimate we have is crude to some degree. This means again we had better use a collapsible measure if we are to have any hope of getting near the average effects within covariate levels; in other words, we need average conditional effects, which ORs do not supply.

1 Like

I have carefully read Frank’s comment and your response but first a clarification: You said that I keep introducing information that was not given in my example followed by computation that is not performable under what was actually given. All computations are from your JCE data tabulated above and all can be verified as coming from that data. You used that for the masks example and I completed the analysis using all the data (not just what you reported in the post).

Coming back to my response, Frank has already responded but I am adding my two cents since some of the queries were about Doi et al. First, I have to agree with Frank, that both papers come across as “non-collapsibility is bad because it is bad” and I say this because the main argument given is that somehow lack of mathematical collapsibility is a horrendous characteristic of an effect measure and all arguments flow from there. JCE is aimed at clinicians who are experts at inferential decision making and so this may not be an acceptable argument to them. I agree with Frank that non-collapsibility could actually be a desirable property of an effect measure if this property means that it retains a monotonic relationship with indices of discrimination for a binary outcome and that is why I brought in diagnostic tests and the AUC as both have a direct mathematical parallel to the OR that cannot be refuted.

In your response to Frank, you have said that non-collapsibility is bad because it means the subgroup ORs can be arbitrarily far from every individual effect, even if all the individual effects are identical and there no uncontrolled study bias (such as confounding). This interpretation is true, but it is not bad. It just follows from the fact that without accounting for outcome heterogeneity, discrimination for an outcome is at its lowest level and thus too is the OR and as we continue to account for heterogeneity this improves. You will have to demonstrate to us that the latter is not the case, to convince us otherwise.

You said that, furthermore, this phenomenon (I assume you mean non-collapsibility) should be expected in practice because there is always unmeasured baseline risk variation (as shown by the fact that individual outcomes vary within our finest measured subgroupings). There is no disagreement here at all and we are not saying otherwise either, but as Frank said, the only reason you then link this to a problem for the OR is because it is non-collapsible and therefore it is bad. This is the main point of contention – what is bad about non-collapsibility in this scenario is what we do not know – please tell us. If you mean that non-collapsibility implies an ever increasing OR that does not collapse as we account for sources of heterogeneity then to us that is good because we expect that to happen as I demonstrated for your example on masks.

Then you say that the clinician can never supply a patient-specific OR estimate, she can only supply the OR estimate for the finest subgroup she’s seen reported (which is rarely anything finer than sex plus some very crude age grouping like “above/below 60”) and that is true – no disagreement at all. What is the implication here is the question because you left it at that. Perhaps you mean that further consideration of heterogeneity of treatment effects will give us a different OR and we agree but why is that bad?

You then say that I am thus forced to conclude that you think it not bad that a reported OR will always mislead the user when applied to individual decisions. No, this is not what we think. We think that it is not bad that a OR is non-collapsible and this is where we differ. Why should a reported OR always mislead a user – are you suggesting that it will mislead because it is non-collapsible and therefore you are back to the same point Frank mentioned – non-collapsibility is bad because it is bad. Why are larger subgroup ORs that do not collapse bad and misleading? I have demonstrated they have to be larger if the factor is prognostic for the outcome because discrimination is incrementally improved. Is that wrong? If so, why?

Then you say that mathematically, what is going on is that the OR is variation (logically) independent of the baseline risk, which is a technical advantage for model simplicity and fitting but which also means that it conveys less information about individual effects than an RD. Why does OR convey less information? If it is because it is non-collapsible then we are back to the same point. My previous analysis of your data in the JCE paper (re-interpreted in terms of masks above) demonstrates that the OR conveys as much information as the RD because both the overall and male specific OR could return both the RR and RD of interest. Why we do not prefer RR and RD is because their numerical values can change across subgroups with different baseline risk independent of their impact on outcome (Doi et al) and this does not happen with non-collapsible measures. As an example the RR for males went lower even though the OR was increased in the mask example.

You say that the RD has exactly the opposite pairing: It is severely dependent on the baseline risk, which is a technical disadvantage for model simplicity and fitting, but which also means it conveys more information about individual effects (specifically the individual RD must average to the group RD). Again, you are back to the same point which is that it is good because it is collapsible and therefore it is more informative because it must average to the group RD – we believe the opposite is true and the mathematical evidence stacks up to this (Doi et al) unless you can point out a flaw in the mathematical expressions we report. One flaw you have suggested previously is bias that McNutt suggested but as pointed out in a letter by Karp, there was no bias – we need to pair the right baseline risk to the right OR.

Coming to the resolutions suggested:

people should stop touting the OR as some kind of magic special measure – we did not give it any special status in the papers – we just demonstrated that it gives the correct posterior probabilities because its numerical value is not impacted by baseline risk and therefore conveys direct information and is less misleading than RR/RD. However we suggested to convert to risks only for interpretation as well but the primary analysis as risks is misleading because of the baseline risk problem.

for those who do want a measure that represents average individual effects, the OR simply will not suffice if the risk is not always “low” (in the sense explained in my papers) . This we disagree with and as pointed out above, the OR produces posterior odds that are accurate in subgroups. You say that 10% deviation from the average is the maximum tolerable error but this only applies if we are interested in a numerical interpretation of the OR as the RR and this is not the case we are arguing for.

1 Like

Doi: In the example I gave above the information in my JCE part 2 paper was not given; that was the point - without that information the clinician can make no good sense of the OR, but could have made good use of the RD. That you introduce the information is stepping outside the example I gave in this thread and guarantees missing the point. You also miss the point that we are advocating analyses can get around these problems and disputes by focusing on estimation of full risk or survival functions rather than on summary measures. Instead it seems you are addicted to the OR as a centerpiece of analysis because of how you prefer to model the relation among risks.

The rest of your response points out some items of agreement; we can agree to agree on those. As for the remaining disagreements, your response looks to me like Frank’s: Repeating over and over what is for us critics is only a blind catechism of how noncollapsibility is not bad, with no comprehension of why in the face of unobservable risk variation we need to at least estimate average conditional effects, not just the marginal conditional effects that conditional ORs represent. I thus feel that no further progress is possible and so will now exit this thread to deal with other issues.

I agree. Thanks for sharing your thoughts and I am sure we all got a better understanding of each others positions and this will help us forge a better path as this debate continues in other forums

2 Likes

I’ll respond to Sander’s last long reply probably tomorrow. I hope Sander doesn’t exit the thread. There are multiple possible reasons I don’t find his arguments convincing, such as (1) I’m dense, (2) the arguments are wrong, or (3) the arguments are right and could be better stated. Sander can’t help with the dense part, but it is entirely possible that there are better ways to make the argument than what I’ve seen to this point. The first JCE paper certainly did not work for me, and I have to agree with Suhail that it seemed to be circular. If Sander can convince me, I think that his arguments will get better for others also.

3 Likes

Frank: The first JCE paper is only a baby introduction to the concept of noncollapsibility, not at all written with you in mind or even this OR controversy so it could not possibly “work” for you. Neither was the second paper written with you in mind, but you need to carefully read it and my 1987 paper and Greenland Robins Pearl 1999 and then my latest series of responses here to stand a chance of getting the points we (and it is now large number of “we”, young and old) have been making for decades. At this point my impression from your responses is that neither you nor Doi has done this kind of long, thorough, careful study, probably because it is a lot of work and the reward might only be unpleasant realizations about what you’ve been missing; still, my reading of your responses convinces me that no progress will occur without you biting the bullet and doing the work without any goal other than understanding our concerns (rather than simply refuting us).

Unrelatedly, I think I see one way to view your A/B ratio: In normal-additive-error (not necessarily linear) regressions with n>>k = no. of model parameters, it looks like an approximate variance ratio, specifically of the ML residual-variance estimates for the all-product and no-product models. In our usual GLMs it would be an approximate overdispersion-ratio estimate. Its calibration may be improved by using residual df = n-k divisors for LR statistics instead of n; that would amount to giving your original ratio a finite-sample correction of dfB/dfA. One could study how this ratio behaves absolutely and in your comparative role using simulations.