Should one derive risk difference from the odds ratio?

I’m speaking of validated findings where the differential treatment effect has an accepted pathophysiologic explanation. A lot of papers have been published in the medical literature based on naive subgroup analysis where differential treatment affect has been claimed but can be shown to be just due to risk magnification by selecting patients with a different average baseline risk than the trial patients as a whole.

As a simple mathematical fact, there will always be effect measure heterogeneity on every scale except at most one. Therefore, heterogeneity is not something that requires an explanation or a mechanism. It is just how we expect the world to look. Instead, the absence of heterogeneity is what requires an explanation.

The mindset where heterogeneity requires a mechanistic explanation, is one that falls apart completely unless we have strong reason to believe that nature tends to generate data using a specific, known link function across most biological settings.

I’m sorry to move into psychoanalysis here, but I wonder if part of the attachment to the odds ratio comes from an inability to let go of the mindset where homogeneity is the default case, and therefore making a leap of faith to the conclusion that nature always uses a logit function when it generates data. Of course, I understand the practical need to act as if homogeneity is the default case - otherwise, we would never have sufficient statistical power to estimate anything. But hopefully, it is still possible to recognize that homogeneity is a heroic assumption and that we can make better decisions if we try to understand how it might and might not arise?

3 Likes

For 5 decades of working in medical research I definitely take homogeneity to be the default assumption. I’ve been involved in hundreds of clinical studies and the fraction of them in which heterogeneity of effect was demonstrated is strikingly small. There are even meta analyses in the literature for continuous Y showing evidence for anti-heterogeneity of treatment effects (lower variance of Y in treated than in placebo groups).

3 Likes

And this has all been on the odds ratio scale? Did this not strike you as remarkable? Did you wonder about why nature would work in such a mysterious way? Or question whether your tests for heterogeneity were sufficiently powerful?

1 Like

No one paper can give you everything. At a bare minimum: Finish reading the one JCE paper that is “part 1 & 2”, which should be easy bearing in mind it is as the editors requested for clinicians who are concerned but baffled by the algebraic arguments; then read my 1987 AJE paper for some algebra.

[JCE part 1 and 2 are really one brief paper that I was forced to cut into 2 short commentaries because of the publisher-imposed word limit on invited commentaries. That you read and reacted to 1 without 2 shows the problem I feared from the split; the points relevant here are all in part 2.]

While my space allotment was very tight, in JCE part 2 I tried to cite as many of the papers on noncollapsibility as I could from recent (2010 on) literature including those from upcoming generations who have delved more deeply into the issue. I hope that my papers will give you a good entry to this newer literature.

2 Likes

The question seems to boil down to: who has a proof burden here? Which prior – no Treatment X Patient interaction vs Patient x Treatment interaction is more reasonable?

The predictive value of any patient treatment interaction after adjusting for obvious factors that @f2harrell recommends is likely to be sufficiently small that spending resources to discover them will have a 0 or negative NPV due to the rapidly diminishing marginal return of information.

I’d say Frank is on solid ground here. You may have a (reasonable) context specific prior, but the burden of proof remains on those asserting a specific interaction. There is a vast family of possible interaction models.

The argument made by David Hand on the illusion of progress in ML classifier tech appear to me to be closely related to this issue of treatment interaction.

Related:
David J. Hand. “Classifier Technology and the Illusion of Progress.” Statist. Sci. 21 (1) 1 - 14, February 2006. https://doi.org/10.1214/088342306000000060

Blockquote
The essence of the arguument is that the improvements attributed to the more
advanced and recent developments are small, and that
aspects of real practical problems often render such
small differences irrelevant, or even unreal, so that the
gains reported on theoretical grounds, or on empirical
comparisons from simulated or even real data sets, do
not translate into real advantages in practice. That is,
progress is far less than it appears.

There is a large number of possible scales to measure effects on, and at most one can be homogeneous.

If Dr. Rando claims that “the default case is homogeneity on risk difference scale, the burden of proof is on you to show heterogeneity”, how is that different from Frank’s claim that “the default case is homogeneity on odds ratio scale, the burden of proof is on you to show heterogeneity”?

If you want to place the burden of proof is on those who claim heterogeneity, you really need to have a very good explanation for why nature always uses a logit function to generate data.

That would support @f2harrell – since any random scale will show heterogeneity when it does not exist, one should use a scale that will express homogeneity.

1 Like

Oh Frank! Anders is dead right about this: You have fallen prey to confusing absence of evidence with evidence of absence exactly as clinicians and med journals do with NHST, in a way I’ve found rampant among med statisticians who should know better. It’s circularity in this sense: You assume a homogeneous model for a study with almost no chance of detecting any reasonable degree heterogeneity; you only test homogeneity, failing to look at interval estimates for the heterogeneity to see how little information you have about homogeneity violations. You then conclude that heterogeneity was “not demonstrated”, as if OR homogeneity is privileged by nature when in reality it is only privileged as a default model by statisticians (for the technical reasons I listed way above). So as with NHST claims, I see your comment as a classic example of the mind-projection fallacy.

I tried to call attention to the problem nearly 40 years ago in
Greenland S. “Tests for interaction in epidemiologic studies: a review and a study of power”, Stat Med 1983;2:243–251 as did Smith & Day in “The Design of Case-Control Studies: The Influence of Confounding and Interaction Effects”, Int J Epidemiology 1984;13:356–365. More recently Poole, Shrier & VanderWeele discussed the problem in “Is the Risk Difference Really a More Heterogeneous Measure?”, Epidemiology 2015;16:714-718.

As for heterogeneity in meta-analyses, see Ch. 33 of Modern Epidemiology 3rd ed. 2008, which explains that we need more than just tests for heterogeneity to make inferences about it - we need interval estimates of it too. I will bet that if you go back over the hundreds of studies you mentioned, few if any of them would be seen as ruling out important OR heterogeneity based on simple inspection of the interval estimates for the heterogeneity (which is measured by the ratios of the covariate-specific ORs, or equivalently by antilogs of the product-term coefficients). Also, your analogy with studies of continuous Y is misleading here because continuous Y offer far more information per observation than do binary Y, and thus far more power to detect heterogeneity.

1 Like

No that’s wrong. It may be that no scale can “demonstrate” homogeneity in the sense of ruling out important heterogeneity by some conventional criterion; in fact that has to be the case for all criteria when effect estimates reverse across strata according to clinical expectations (which is often seen in situations involving dangerous treatments). Then too, when we examine the interval estimates for heterogeneity it is usually the case that homogeneity cannot be “demonstrated” for any easily interpretable scale (including ORs as well as RRs and RDs).

2 Likes

No one is saying homogeneity is exactly zero. That is a straw man. The question is between simple predictive models, and more complex ones that incorporate treatment X patient interaction. I think it is reasonable to have a prior that is skeptical of this.

Well, yes, if you know of a scale on which the effect of exposure is a priori known to be homogeneous, you should choose it. The question is how you would know that you have the right scale.

So this brings us back to my working paper, from almost 100 posts back in the thread. In that paper, we show that for class of problems that occurs with some regularity in pharmacoepidemiology, the best approximation for a stable effect measure is the switch risk ratio. This is not a perfect solution, but it works in some cases. In some cases where it doesn’t fully work, this approach gives you a framework for reasoning about what effect modifiers you would need to control for, or how you might set up sensitivity analysis or partial identification strategies. In some other cases, it clearly doesn’t work at all, but at least the framework will then tell you that, because it will then be impossible to construct a plausible causal model of the type we propose.

I do not know of anyone who has provided a biological narrative (“causal model”) that results in stability of the odds ratio across strata, for any exposure-outcome relationship. This hints that it is unimaginable that nature can work in a way that is best approximated by a model based on a logit link.

Most arguments for stability on the odds ratio focus on the fact that, in contrast to the risk ratio or the risk difference, it is mathematically possible to have stability for any baseline risk. This is not contested. I like to think of this property in terms of whether what we call the “effect function” is closed on (0,1). Such closure is not a unique attribute of the odds ratio. This property is therefore not a sufficient criterion for stability (though it may well be a necessary one). The effect function for the switch risk ratio also has the closure property.

Blockquote
Well, yes, if you know of a scale on which the effect of exposure is a priori known to be homogeneous, you should choose it. The question is how you would know that you have the right scale.

You never know the specific model is correct, you only act as if it is sufficiently close (after adjustment for known sources of heterogeneity) for the purpose at hand.

Just to be clear, I see the dispute between you and Frank one between 2 competing models, with the same prognostic covariates, only your model has at least 1 additional term for interaction.

Is that a reasonable representation of your position?

No. This is not correct. My model has the same number of parameters.

(This manuscript only proposes an effect measure, not a model. A later manuscript will generalize it to regression models, which is what I am referring to here.)

1 Like

Fair enough. I need to think about this a bit more. But I don’t see any reason to see Frank’s prior as unreasonable just yet. I’ll let him speak for himself before I chime in again.

1 Like

That does not address why we should be “skeptical” of product terms on one scale but not others, since it is often the case that the terms won’t look “necessary” on any of the common scales (simply because of inadequate estimation precision on all the scales). The answer is one of logic + convenience: the logit scale allows variation independence of parameters and that can lead to better fit of first-order models. This is not a reason to be skeptical of product terms: nature could not care less what we think about them. It is only an argument for defaulting to the simplest case when no other consideration is available for model selection. That is a parsimony heuristic; to instead treat the default no-product model as if it were a principled scientific theory to stand until falsified (as the use of “skeptical” suggests) is yet another variation on the nullistic fallacy.

This also reveals a problem raised by post 148, which seemed to recommend picking the link function based on the data and then using statistics (e.g., predictions) from the resulting model as if one had picked that link a priori. If so that is in effect cheating in the statistical sense because it fails to account for estimating an implicit and highly nonlinear parameter. Consequently the out-of-sample predictive performance will typically be no better than the more complex-looking model we get from sticking with a simple link function (like the logit) and using a product term as needed; at least then we can put an interval estimate on the extra term. That will also make it easier to estimate causal effects, which is what I am most concerned about and I thought was the topic of this thread.

The problem here is that the simplicity of a data-selected model is largely deceptive, as has been documented by a half-century of analytic and simulation studies; by 1978 there was even an entire book addressing such model-selection problems, the still conceptually superb Specification Searches by EE Leamer, available free at
https://www.anderson.ucla.edu/faculty_pages/edward.leamer/books/specification_searches.htm
I would bet Frank covers the predictive issue in his book; I’ll leave it to him to point to that. [Robins and I wrote a summary of some of Leamer’s key points for causal logistic modeling, “The role of model selection in causal inference from nonexperimental data”, Am J Epidemiol 1986;123:392-402, and some basic updates in light of 30 years of developments were reviewed in Greenland, Daniel & Pearce, “Outcome modelling strategies in epidemiology: traditional methods and basic alternatives”, Int J Epidemiol 2016;45:565-575.]

As you stated, the challenge is that assumptions → simulations → results consistent with assumptions. I prefer to do things like what was done here where odds ratios, risk ratios, and risk differences were given every chance to vary in a real, large RCT dataset and we found showed how each varied over patients.

I for one would never use that metric and see no convincing argument for using a non-prospectively-defined measure, as we’ve discussed before.

I’ve look at this in many medical settings and with many link functions including the identity link in approximately Gaussian data, where there are fewer issues. The number of replicable heterogeneity of treatment effect examples in the literature is extremely small. You are correct that in a given situation strong evidence of absence is not available, but taking the literature as a whole, and especially the cardiovascular literature where my collaborations primarily reside, the examples of differential treatment effect found, claimed, and validated are few in number. Some of the best examples are heterogeneity in metabolism but even those have often not translated to differential effects on clinical outcomes.

1 Like

Yes, you keep saying that. Since you are just stating that you don’t like the idea (instead of making an argument that I can respond to), it hasn’t really turned into much of a discussion.

And to be clear: The switch risk ratio is prospectively defined in every sense. It takes two probabilities as inputs, and outputs a real number from the interval (-1,1), in a prespecified way. That number is a sufficient representation of the effect if the data is generated by a class of sufficient-component cause models. Those models have a clear biological interpretation, so you can evaluate whether you believe the biological model. Since no such model can exist for the odds ratio, there is no way to do anything similar for a logistic model.

I’m not clear how you define the ratio in a situation where you don’t know in advance which side of 0.5 the event’s risk occurs. And what if males have risk < 0.5 and females have risk > 0.5?