Should one derive risk difference from the odds ratio?

Then Frank you still don’t understand how the granularity argument weighs against the OR. Given we all have to settle for subgroup measures, we need measures that at least represent an average within each subgroup so we have some assurance of errors averaging out over the remaining unobserved within-subgroup effect variation. For this goal the OR fails spectacularly (read my second paper and my 1987 paper), the RR does not too badly and the RD has no problem at all.

“The fact that an odds ratio is based in incomplete conditioning makes it neither misleading nor unuseful, just as a logistic model that included 5 out of the 7 really important patient factors is still useful.”
You are here confusing two distinct issues: (1) whether the logistic model is useful; I never disagreed. (2) whether the odds ratio pulled from a logistic model is good summary of the effects conditional on the covariates if the outcome is common within covariate levels; it isn’t, it’s misleading for reasons many have explained over and over again including right here in this discussion. It seems those unable to grasp this problem are inevitably among those whose training omitted causal models as distinct from statistical ones - which includes most statisticians who graduated in the last century. Well, as has been said - of physics no less - that “science progresses funeral by funeral”; as has been noted regarding “statistical significance”, unfortunately progress in statistics is not as rapid.

“I will never criticize a measure because it can’t be compared to crude marginal estimates that should not have been published.” This sentence illustrates some leading confusions in OR defenses: (1) it confuses crude unadjusted estimates with marginal (population-averaged) adjusted estimates (others have noted this confusion is embedded throughout your discussions). The debate is about marginal adjusted estimates vs. conditional adjusted estimates and how marginal adjusted ORs are not averages of the conditional ORs, whereas marginal adjusted RRs and RDs are averages of conditional RRs and RDs. No one in this debate has called for not adjusting for measured predictive covariates so your use of “crude” here is just another mistake (although many RCTs are published with no such adjustment in the misguided belief that randomization means they don’t have to adjust). (2) Your comment fails to grasp that every conditional estimate is also marginal and unadjusted and thus “crude” with respect to the covariates left out of the model, and there are always many outcome-predictive covariates (unadjusted potential confounders), even in RCTs; so every estimate we have is crude to some degree. This means again we had better use a collapsible measure if we are to have any hope of getting near the average effects within covariate levels; in other words, we need average conditional effects, which ORs do not supply.

1 Like

I have carefully read Frank’s comment and your response but first a clarification: You said that I keep introducing information that was not given in my example followed by computation that is not performable under what was actually given. All computations are from your JCE data tabulated above and all can be verified as coming from that data. You used that for the masks example and I completed the analysis using all the data (not just what you reported in the post).

Coming back to my response, Frank has already responded but I am adding my two cents since some of the queries were about Doi et al. First, I have to agree with Frank, that both papers come across as “non-collapsibility is bad because it is bad” and I say this because the main argument given is that somehow lack of mathematical collapsibility is a horrendous characteristic of an effect measure and all arguments flow from there. JCE is aimed at clinicians who are experts at inferential decision making and so this may not be an acceptable argument to them. I agree with Frank that non-collapsibility could actually be a desirable property of an effect measure if this property means that it retains a monotonic relationship with indices of discrimination for a binary outcome and that is why I brought in diagnostic tests and the AUC as both have a direct mathematical parallel to the OR that cannot be refuted.

In your response to Frank, you have said that non-collapsibility is bad because it means the subgroup ORs can be arbitrarily far from every individual effect, even if all the individual effects are identical and there no uncontrolled study bias (such as confounding). This interpretation is true, but it is not bad. It just follows from the fact that without accounting for outcome heterogeneity, discrimination for an outcome is at its lowest level and thus too is the OR and as we continue to account for heterogeneity this improves. You will have to demonstrate to us that the latter is not the case, to convince us otherwise.

You said that, furthermore, this phenomenon (I assume you mean non-collapsibility) should be expected in practice because there is always unmeasured baseline risk variation (as shown by the fact that individual outcomes vary within our finest measured subgroupings). There is no disagreement here at all and we are not saying otherwise either, but as Frank said, the only reason you then link this to a problem for the OR is because it is non-collapsible and therefore it is bad. This is the main point of contention – what is bad about non-collapsibility in this scenario is what we do not know – please tell us. If you mean that non-collapsibility implies an ever increasing OR that does not collapse as we account for sources of heterogeneity then to us that is good because we expect that to happen as I demonstrated for your example on masks.

Then you say that the clinician can never supply a patient-specific OR estimate, she can only supply the OR estimate for the finest subgroup she’s seen reported (which is rarely anything finer than sex plus some very crude age grouping like “above/below 60”) and that is true – no disagreement at all. What is the implication here is the question because you left it at that. Perhaps you mean that further consideration of heterogeneity of treatment effects will give us a different OR and we agree but why is that bad?

You then say that I am thus forced to conclude that you think it not bad that a reported OR will always mislead the user when applied to individual decisions. No, this is not what we think. We think that it is not bad that a OR is non-collapsible and this is where we differ. Why should a reported OR always mislead a user – are you suggesting that it will mislead because it is non-collapsible and therefore you are back to the same point Frank mentioned – non-collapsibility is bad because it is bad. Why are larger subgroup ORs that do not collapse bad and misleading? I have demonstrated they have to be larger if the factor is prognostic for the outcome because discrimination is incrementally improved. Is that wrong? If so, why?

Then you say that mathematically, what is going on is that the OR is variation (logically) independent of the baseline risk, which is a technical advantage for model simplicity and fitting but which also means that it conveys less information about individual effects than an RD. Why does OR convey less information? If it is because it is non-collapsible then we are back to the same point. My previous analysis of your data in the JCE paper (re-interpreted in terms of masks above) demonstrates that the OR conveys as much information as the RD because both the overall and male specific OR could return both the RR and RD of interest. Why we do not prefer RR and RD is because their numerical values can change across subgroups with different baseline risk independent of their impact on outcome (Doi et al) and this does not happen with non-collapsible measures. As an example the RR for males went lower even though the OR was increased in the mask example.

You say that the RD has exactly the opposite pairing: It is severely dependent on the baseline risk, which is a technical disadvantage for model simplicity and fitting, but which also means it conveys more information about individual effects (specifically the individual RD must average to the group RD). Again, you are back to the same point which is that it is good because it is collapsible and therefore it is more informative because it must average to the group RD – we believe the opposite is true and the mathematical evidence stacks up to this (Doi et al) unless you can point out a flaw in the mathematical expressions we report. One flaw you have suggested previously is bias that McNutt suggested but as pointed out in a letter by Karp, there was no bias – we need to pair the right baseline risk to the right OR.

Coming to the resolutions suggested:

people should stop touting the OR as some kind of magic special measure – we did not give it any special status in the papers – we just demonstrated that it gives the correct posterior probabilities because its numerical value is not impacted by baseline risk and therefore conveys direct information and is less misleading than RR/RD. However we suggested to convert to risks only for interpretation as well but the primary analysis as risks is misleading because of the baseline risk problem.

for those who do want a measure that represents average individual effects, the OR simply will not suffice if the risk is not always “low” (in the sense explained in my papers) . This we disagree with and as pointed out above, the OR produces posterior odds that are accurate in subgroups. You say that 10% deviation from the average is the maximum tolerable error but this only applies if we are interested in a numerical interpretation of the OR as the RR and this is not the case we are arguing for.

1 Like

Doi: In the example I gave above the information in my JCE part 2 paper was not given; that was the point - without that information the clinician can make no good sense of the OR, but could have made good use of the RD. That you introduce the information is stepping outside the example I gave in this thread and guarantees missing the point. You also miss the point that we are advocating analyses can get around these problems and disputes by focusing on estimation of full risk or survival functions rather than on summary measures. Instead it seems you are addicted to the OR as a centerpiece of analysis because of how you prefer to model the relation among risks.

The rest of your response points out some items of agreement; we can agree to agree on those. As for the remaining disagreements, your response looks to me like Frank’s: Repeating over and over what is for us critics is only a blind catechism of how noncollapsibility is not bad, with no comprehension of why in the face of unobservable risk variation we need to at least estimate average conditional effects, not just the marginal conditional effects that conditional ORs represent. I thus feel that no further progress is possible and so will now exit this thread to deal with other issues.

I agree. Thanks for sharing your thoughts and I am sure we all got a better understanding of each others positions and this will help us forge a better path as this debate continues in other forums

2 Likes

I’ll respond to Sander’s last long reply probably tomorrow. I hope Sander doesn’t exit the thread. There are multiple possible reasons I don’t find his arguments convincing, such as (1) I’m dense, (2) the arguments are wrong, or (3) the arguments are right and could be better stated. Sander can’t help with the dense part, but it is entirely possible that there are better ways to make the argument than what I’ve seen to this point. The first JCE paper certainly did not work for me, and I have to agree with Suhail that it seemed to be circular. If Sander can convince me, I think that his arguments will get better for others also.

3 Likes

Frank: The first JCE paper is only a baby introduction to the concept of noncollapsibility, not at all written with you in mind or even this OR controversy so it could not possibly “work” for you. Neither was the second paper written with you in mind, but you need to carefully read it and my 1987 paper and Greenland Robins Pearl 1999 and then my latest series of responses here to stand a chance of getting the points we (and it is now large number of “we”, young and old) have been making for decades. At this point my impression from your responses is that neither you nor Doi has done this kind of long, thorough, careful study, probably because it is a lot of work and the reward might only be unpleasant realizations about what you’ve been missing; still, my reading of your responses convinces me that no progress will occur without you biting the bullet and doing the work without any goal other than understanding our concerns (rather than simply refuting us).

Unrelatedly, I think I see one way to view your A/B ratio: In normal-additive-error (not necessarily linear) regressions with n>>k = no. of model parameters, it looks like an approximate variance ratio, specifically of the ML residual-variance estimates for the all-product and no-product models. In our usual GLMs it would be an approximate overdispersion-ratio estimate. Its calibration may be improved by using residual df = n-k divisors for LR statistics instead of n; that would amount to giving your original ratio a finite-sample correction of dfB/dfA. One could study how this ratio behaves absolutely and in your comparative role using simulations.

This is an incredibly long thread that has technical details beyond my pay grade. But I would like to add a different perspective and some history.

First, I would like to thank those that cited some of our work and make some clarifications. Our JCE paper showing OR overestimate risk ratios was a teaching example. Suggesting we were proposing constancy of an effect across different baseline risks in general or even in the condition we used, is a misinterpretation that surprised me.

Second, Doi suggests the Jon Deeks paper showing OR more homogeneous supports his view. The paper with Charlie Poole and Tyler VanderWeele explaining the limitations of Jon Deeks analysis is extremely important. The paper came about because I asked a question to Charlie at a conference when he said RD was the most appropriate measure. I suggested OR and RR were more homogeneous based on Jon’s study that I was aware of. Tyler became interested because heterogeneity is just a form of “interaction”. Tyler’s expertise in understanding “interactions” allowed him to see what the rest of us had missed - Jon’s results were expected because of power issues, even if OR was not truly more homogenous. But the paper has more implications if you read carefully, along with the accompanying editorial by Panagiotou and Trikalinos (On Effect Measures, Heterogeneity, and the Laws of Nature). What does it actually mean to be more or less homogenous? How are we measuring this when the scales are different? Until we address this philosophical question, the issue of differences in “homogeneity” with baseline risk across different scales cannot really be answered. As one example of the associated dangers, the I-squared (a measure of relative heterogeneity often used in meta-analysis) is so often misinterpreted as “measure of absolute heterogeneity” that several papers have been written trying to correct this (see Borenstein, Higgins, Hedges and Rothstein, Res Synth Methods 2017 DOI: 10.1002/jrsm.1230DOI: 10.1002/jrsm.1230 for an excellent review).

Third, “homogeneity” on any scale will be dependent on the true data generating process (DGP) mechanisms. We just don’t know what that is for any particular context. As Sander suggested, it is likely different for different contexts or subpopulations. Examining for effect heterogeneity in a single study by including interaction terms sounds like a good idea, but these analyses are usually grossly underpowered, and therefore do not provide much insight. Relying on p-values to conclusions will likely lead to many incorrect inferences.

In the end, we conduct studies to help inform decision makers, who may be public health, clinicians, or patients. We need to communicate effectively to these groups. I found Doi’s example above with subgroups actually quite difficult to follow but maybe others found it easy. That said, I want to highlight that Doi suggests to convert the increase in death odds to number needed to harm (NNH) for the patient. I am not a fan of NNT/NNH because my experience is that patients do not understand it well, notwithstanding claims by some others. That is a separate discussion. Regardless, NNH is simply the inverse of RD. If Doi believes we should be conveying NNH, he is saying that RD is the more relevant measure than OR to convey to patients.

My own perspective is that for a dichotomous treatment and outcome, patients/decision makers must understand the risk with treatment and the risk without treatment (see Stovitz SD, Shrier I. Medical decision making and the importance of baseline risk. Br J Gen Pract 2013;63:795-7 for more detailed discussion). The risk (not risk ratio or risk difference) is preferable to the odds (not odds ratio) because they are more easily understood. I would suggest you work through the following example. Do you want to have early surgery based on the following:

The odds of having a good outcome without treatment is 1:1, and the odds of having a good outcome with treatment is 3:1. However, if you have surgery, the odds are also 1:9 that you will have scarring and permanent damage due to the surgery. If you delay surgery, the results are not as good and the odds of having a good outcome are 2:1.

How long has it taken you to think about the numbers? And this is a site for statisticians and epidemiologists. Now imagine you are talking to a patient or decision maker who was never any good at math. Why not do your own study and try it with some of your family members?

Now try the same exercise with risks (probabilities).

The probability of a good outcome (risk) is 50% without treatment and 75% with treatment. However, the risk of having scarring and permanent damage due to surgery is 10%. If you wait to see if you heal without surgery but don’t, and therefore decide for late surgery, the probability of a good outcome is 67%.

From the risks, we can easily see that if surgery was routinely performed, we would be operating on 50% of the people that don’t need it (because 50% do well without surgery). Of those that would benefit with surgery (an additional 25%), 2/3 (17% of 25%) would have done just as well if they had surgery later. With a little more math that can be explained to patients (preferably with iconograms so they can visualize the differences), early surgery helps 8/100 more people (75/100 vs 67/100, after including those that heal without surgery and with late surgery). However, 10/100 will have scarring and permanent damage. Do you want the surgery?

Some will decide for early surgery and some against early surgery but that is not important. You may even come up with the same answer using both methods. If you find it easier to use odds instead of probabilities, then you should use odds. If you find it easier to use risks/probabilities, you should use risks/probabilities. If you are communicating to someone else, you should use whatever they find easiest to understand. My own experience is that probabilities are understood much better by my patients than other measures.

5 Likes

I think your post finally makes it clear what the confusion is about in this thread. When we talk about “portability across baseline risk“ we do not mean that subgroups somehow have the same baseline risk or that the OR (or any other effect) must be equal across subgroups (covariate present versus absent). We mean that the effect is portable in the sense that the change in baseline risk due to sub-grouping does not contribute to the numerical value of the effect independent of its discrimination of the outcome in the subgroup of interest. What then follows is that for this property to be present, the effect measure should be expected to be non-collapsible.

However, you have not addressed the point I raised about your paper which is to question if diabetes modifies the effect of treatment on mortality for which you concluded “the RR results across the strata in Table 1 are consistent with absence of causal effect modification, in that the decreased risk with treatment on the multiplicative scale is independent of the baseline risk.” First, the RR is collapsible and our point of view (not necessarily that of the others in this thread) is that we are unable to say anything about its numerical independence from baseline risk vis-a-vis outcome discrimination and it is therefore unlikely that we can agree with your conclusion regarding the RR. Can we conclude this from the OR? This is difficult to say without biological input because the apparent product term in small data-sets may just be an artifact of the sample (even with P value 0.01) and not a feature of the population and thus a product term inclusion may (or may not) amount to over-fitting. There will of course be far more need for product terms in log-risk models simply to keep risks bounded between 0, 1 – an argument Frank has raised several times in this thread.

1 Like

Fair enough. I’ve done 1/2 the study I need to do. If you had to pick one paper that would would be most beneficial for someone coming from my point of view, which paper would it be?

Yes about variance ratio and n-k; the latter is along the lines of some analyses I’ve done where I “chance correct” LR \chi^2 before computing the ratio, by subtracting it’s d.f. (number of parameters estimated besides intercept(s); the expected value of \chi^2 under the null). When \chi^{2} < 0 I just set it to 0.

Sander while I don’t agree with everything you said, there is much to agree with here and clearly a lot of food for thought.

I have yet to be convinced that marginal adjusted odds ratios are interpretable or applicable to individual patient decision making. In my world view, all non-conditional effect measures were originally designed for the (never occurring) homogeneous outcome case (within treatment groups).

But some things you said earlier were in this morning’s waking thoughts:

In normal linear models without treatment interactions we can summarize treatment effects with only two parameters: the difference in means and the patient-to-patient variance within treatment. In nonlinear models the root cause of the confusion is the desire to have a single number summary. ORs are an attempt to arrive at a single number, but the clinical decision making that needs to be made needs to supplement the OR with an estimate of base risk in order to compute the estimated probability of success under either treatment, or to compute the RD. For a single patient, the best outputs of analysis of binary Y are the two probabilities, and short of that, their difference (RD). Because even in the absence of interaction the RD varies tremendously from patient to patient, a clinical trial summary needs to be a distribution. And we can use a simulation-extrapolation (SIMEX)-like analysis to learn about the effect of incomplete conditioning (omission of important prognostic variables from the model) on these RD distributions. Focusing on the entire distribution is what allows us to get around the difficulty clinicians have in interpreting odds ratios.

So you gave me inspiration for a short paper, the outline of which I had finished before the coffee was brewed. Going to work on it now. I’ll use the 40,000 patient GUSTO-I dataset for my examples.

2 Likes

This does sound appealing. Looking forward to the paper. Note that our interest is not only in clinical decision making but also in understanding how the treatments interact with biological subgroups for research / causal discovery purposes.

For the latter, RD distributions will not help because they represent variation in patients not variation in effects (unless strong interactions are present). For mechanistic heterogeneity in treatment effects you need to start with a scale for which such heterogeneity is allowed to be absent (otherwise we’d be able to write a new paper every hour). Sander earlier pointed to machine learning type of analyses where the underlying model is flexible, which you can then use to estimate almost assumption-free ORs (if the sample size is huge); or fit regression models with pre-specified possible interactions.

2 Likes

Yup. My sense is that this is what @AndersHuitfeldt is trying to do: come up with scales that are invariant when the putative causal processes do not change. And it is this difference in goals/focus that has fueled at least some of the current debate. It is quite possible that no approach will perfectly cover both purposes. Which is OK. Both sides of this methodological argument are very valuable and we can choose approaches based on our purposes as we do for some many other tasks.

2 Likes

There will probably still be disagreement between the statistically inclined and medical practitioners on this issue.

It seems like any field of medicine where the main effect fails to demonstrate predictive accuracy in future studies, subtypes and/or interactions are postulated. Just look at the threads on sepsis as an example.

The statistician will point out that the info theoretic complexity of main effects + interactions is much higher than a simple main effect model. High complexity models have lower prior probabilities.

Any attempts by the practitioners to use “excess heterogeneity” arguments from RCTs will be dismissed as artifacts of modelling, unless done with very large samples with unobjectionable models.

2 Likes

This does not in any sense explain why people disagree with you.

If I understand correctly, your claim is that the odds ratio in a stratum is a deterministic function of the conditional sensitivity/specificity/discrimination in that stratum, i.e. the test characteristics when using the exposure variable to predict the outcome. I have not double checked this mathematical claim, but it seems plausible, so for the purposes of this discussion, I am going to assume that it is true.

But so what? What makes this relevant to choice of effect measure? I could see it being relevant if you had some reason for expecting the test characteristics to be relatively constant. But an immediate consequence of what you are saying, is that the discrimination itself would change arbitrarily depending on what you have conditioned on. So if you use the discrimination to measure the strength of association, you end up with a measure whose numerical value is an arbitrary function of your conditioning set. How is that useful to a physician, who may not have the ability to condition on exactly the same covariates as you?

I will also note that I have philosophical misgivings about framing effect measures in terms of diagnostic tests, particularly for the observed outcome Y (rather than the counterfactual outcomes Y(a=0) and Y(a=1)). I suspect if you thought through everything in terms of causality, you would end up with a very different conclusion.

Note that in contrast to standard diagnostic tests, in which test characteristics are often reasonably assumed to be stable across strata, we are talking about a situation where the “gold standard” abstraction is inherently caused by the test itself. This breaks the standard rationale for expecting test characteristics to be stable.

I used to think there was such a dichotomy but not any more. Interactions are scale dependent and even when they are added as “additive” in a main effects model, there is or should be biology behind them.

1 Like

Interesting point and counter-point. I have in different contexts shifted between your view and that of R_cubed. This has given me an idea for a very simple simulation that can shed light on this and will run it and post here soon - perhaps it will also shed more light on the effect measure debate too

1 Like

Prediction: if you did a meta analysis to try to classify differential treatment effect as biological/structural vs. not, there will be a strong correlation between that and whether there is interaction on an unrestricted scale.

What do you mean by classifying treatment effects as biological/structural or not? All treatment effects are determined by biology. Differential treatment effect is the default case. An unsaturated model makes no restrictions on biology. Homogeneous treatment effect is a special case of the unsatured model.

If there is equal treatment effects on some scale (and if it did not happen because of chance/insufficient sample size), then I can assure you that it happened for some biological/structural reason. In every single case.

Frank, I just completed the simulation - not sure it makes sense but lets see.
I took the following hypothetical data

y
x 0 1 Total
0 120 80 200
1 80 120 200
Total 200 200 400

Then randomly sampled 200 observations and assigned Z=0 to them and Z=1 to the remainder

Ran the logit risk model with (_a) and without (_b) the product term. Obviously Z has no prognostic value for Y so the lnOR for X should be 0.8109

I ran this a thousand times and plotted _a and _b
Here are the results

Addendum
Context: The contest here is that all modifiers in the natural world are constructed by our thoughts about what should be but then driven by what we observe from the data. Thus they cannot be simulated. What we can simulate is absence of modification/interaction - which is this simulation
Interpretation: Most product terms in clinical research will be artifacts of the sample as lnOR(XY) in model _a varied grossly but in model _b was spot on and to postulate a prior mechanism that is tested by the data or to generate a data driven hypothesis about biological interactions does not seem possible. Keep in mind this was only random error and in the real world we have systematic error too.