Should one derive risk difference from the odds ratio?

Sander · July 4, 2021, 7:21pm

Anders: While I’m all for exploring practical limits of simple causal models and their implications, right now it looks to me that you are also doing what you complained about Doi did to defend his claims: Adding unnecessary and even irrelevant diversions that only obscure your mistake in issuing too sweeping a denial of the existence of simple mechanistic causal population structures under which the causal RD can be stable (unmodified) despite variation in background risk (risk under absence of both factors).

For those not having a copy of Modern Epidemiology (Ch. 18 in 2nd ed, Ch. 5 in 3rd ed) where the results at issue are covered, they were taken from Greenland S, Poole C, “Invariants and noninvariants in the concept of interdependent effects”, Scand J Work Environ Health 1988;14:125–129 (as usual, available on request). These citations delineate an entire class of mechanisms which can produce constant, additive RDs across a large range of baseline risk variation. In these mechanisms, for every individual the x effect does not depend on the z level and the z effect does not depend on the x level. A consequence is that there would be no modification of the RD (no effect-measure modification) for either x across z or z across x, i.e., perfect additivity of separate-treatment RDs to get joint treatment effects.

Variation in the baseline risk across populations need not destroy this additivity. I understand that such mechanistic RD additivity seems “paradoxical” given the range restrictions, but that just shows how your intuition suffers the same sort of limits as does Doi’s and Harrell’s (mine would too but for the fact that I encountered these results 40 years ago). As with any causal model, whether such noninteractive, non-modifying structures are plausible or realistically transportable is context dependent and hence largely in the eye of the beholder, so is a separate topic that we won’t resolve here.

It’s fine to pursue the source of an incorrect initial intuition to see what can be learned from it, but randomization and prediction are irrelevant to the present case: The above cites and my points are about the true causal RDs computed directly from the full x*z potential-outcome vector of each population member under every exposure combination. Whether we are accurately estimating effects or predicting risks are vast topics that do not bear on the results.

AndersHuitfeldt · July 4, 2021, 8:43pm

I’ll get back to you in a week’s time after having thought through this thoroughly. For now, I will just point out that your read on my psychology is wrong. I try very hard to always acknowledge my mistakes when someone convincingly points them out. Moreover, I am confident that I have a track record to back up that claim.

I don’t know whether what’s happening is that I’m not smart enough to understand your argument, or that I’m not communicating precisely enough for you to understand what I’m saying. But in either case, it is not because I am making any attempts to obscure my mistakes or defend a lost position.

s_doi · July 5, 2021, 8:32am

If we indicate different baseline risks by a, b, c… etc then the implication is that:
a(RR-1) = b(RR-1) = c(RR-1) …
Which means that, by your own admission, in essence there are an entire class of mechanisms suggesting non-constancy of the RR.

Sander · July 5, 2021, 3:40pm

“By your own admission”? Where in this thread has anyone claimed the RR is constant or nearly so outside of special models? You and coauthors are the ones who made erroneous assertions about constancy of the OR based on a confusion of statistical models with mechanistic causal models, combined with a misanalysis of meta-analytic data; those mistakes are what kicked off this thread and the exchanges in JCE. What is your point now? Have you finally learned that the OR is not constant under simple mechanistic models?

My position all along has been that in epidemiologic practice any assumption of constancy is just a statistical convenience which rarely has any credible basis in data or biology (I can’t speak for other fields, like bioassay which has a century-long literature on the topic). It’s been known for generations and obvious from the math that (with rare exceptions) constancy of one measure means nonconstancy of the rest - so if the RD is constant, we get nonconstancy of the OR, RR, relative difference, AF, and so on. And it’s obvious from correct meta-analyses that it’s rare indeed that one can correctly claim even one of them is constant (as opposed to mistakenly claiming constancy based on failure to “reject” using nearly powerless statistical tests).

When (as usual) the direction of heterogeneity is unclear, the best we can hope from assuming constancy is that it removes enough noise to more than compensate for the bias it creates - the old “bias vs. variance” tradeoff in minimizing mean-squared error. But that hope is not guaranteed and depends entirely on the target of estimation. Better still is to allow some heterogeneity controlled by penalties (priors) that can average the extremes between unconstrained heterogeneity and complete homogeneity, as allowed by hierarchal modeling of product terms. That became computationally feasible 40 years ago and by the 1990s could be found in textbooks, e.g. p. 431 of Modern Epidemiology 2nd ed. 1998 (p. 438 of the 3rd. ed.).

AndersHuitfeldt · July 6, 2021, 8:56am

A few days ago, I wrote this in this thread:

Sander then correctly pointed out an example from Modern Epidemiology page 76, table 5.2, in which they show that if there are no “interaction response types" (which roughly means that for every individual, at least one of the interacting factors A and V must have no effect, for both values of the other interacting factor) then there will be stability of the causal risk difference for A, between the setting where we intervene to set V to 0, and the setting where we intervene to set V to 1. If we additionally assume exchangeability (independence between V and Y(a,v)), the causal risk difference for A will be equal between groups defined conditional on the observed value of V.

I do not contest this, and I am sorry about the imprecision in my claim. I do however maintain that this result is not very relevant to the kinds of decision problems that I am considering, in which a clinical decision maker needs to individualize the causal effect of the intervention (A) to a patient with V=v

Just so everyone is on the same page, let us differentiate between three subtly different phenomena:

Statistical interaction/Product term in observational model: Both interacting factors are observational
Effect modification: Primary intervention variable is counterfactual, the other factor is observational
Causal interaction: Both intervention variables are counterfactual

The reason I consider effect modification as more relevant to the decision problem than causal interaction, is that in order to individualize treatment to a person with V=v, doctors cannot intervene on V, and therefore need to find some aspect of reality that is invariant over observed V.

I am fairly sure you will agree that if there exists a common cause of V and Y, which we will call U, then even if there are no interaction response types between V and A, there is no guarantee that there will be no additive effective modification. To see why, you can stratify table 5.2 by U, and work out Pr(Y^(a) | V=v) for all a and v (using law of total probability over U). I am not going to type up the maths, but I would be very surprised if this claim is incorrect.

I note that if we are considering causal interaction, then any differences in baseline risk are due to the causal effect of V. This model therefore cannot give you a reason to expect additive stability between groups whose baseline risks differ because of common causes of V and Y (at least not without expanding the model to also claim no interaction types between A and all predictors of V ).

So in summary, What I should have written is something like this:

“Would you be able to formalize the biological story for a data generating process that leads to absence of effect modificaiton, i.e. stability of the causal risk difference across groups constructed based on observational variables, and whose baseline risk differs arbitrarily?.

Sander · July 6, 2021, 3:46pm

Thanks Anders for the concession and clarifications. I certainly agree that there is no guarantee of observing additivity if there exists a common cause U of V and Y, because we now have confounding of the V effect on Y by U. To be more precise, we should not expect additivity of the observed RD(AY|V) and RD(VY|A) even if the causal AY and VY RDs do add, because in your set-up the observed RD(VY|A) is confounded by U.

It appears that you want to use V to guide decisions about treatment A without concern about confounding of V effects; that’s fine. But as far as I can see, all you are saying is that you are concerned with a case in which we think RD(AY|V) is unbiased enough for the AY effect to provide to clinicians, but we may have failed to control enough confounding of RD(VY|A) to make valid inferences about the V effect on Y. In that case it should be no surprise that we cannot make valid inferences about the interaction of A and V effects.

In sum, the one point I see coming out of your arguments is that if you want to study the causal interactions of A and V on Y, you have to control confounding of both A and V. More precisely and generally: To study causal interactions among components Xj of a vector of exposures X = (X1,…,XJ), you have to control confounding of X, e.g., by blocking all back-door paths from X to Y. A corollary is that to ease correct deductions about confounding control for studying causal interactions from a DAG, we ought to examine the graph that replaces the potentially interacting exposures with a single vector of them.

Given that your concern translates into a higher-order confounding problem, my answer to your query would be Yes: I can easily formalize a biological story for the data generating process that leads to absence of RD modification (i.e. stability of the causal risk difference across groups constructed from observational variables, with baseline risks differing arbitrarily within logical constraints): All I need for that is (1) no AV-interaction response types and (2) sufficient control of confounding of effects of the exposure vector (A,V) on Y; then there will be no modification of the observed RD(AY|V) across V within confounder levels and also after marginal adjustment (or after averaging using any shared weighting scheme). Note that (2) is no more stringent a requirement than that for mediation analysis, in which we replace the baseline variable V with an mediator M between A and Y.

Do you agree (in which case this subthread should be done) or can you exhibit a mistake in my reasoning?

s_doi · July 7, 2021, 7:15pm

This discussion on causal effect modification seems interesting so lets take an example. If we run some GLMs on Sanders example in the recent non-collapsibility paper in JCE and accept that there is neither confounding nor a sample artifact, the product term (EXP) in the log-risk model is 0.5 (sub-multiplicative) and in the logistic model is 1.6 (super-multiplicative). Thus male gender (M) modifies treatment (Rx) related death risk (decreased by 50% in M compared to females (F)) while simultaneously M also modifies treatment related death odds (increased by 55% in M compared to F). However in both M and F, Pr(death) increases with Rx (M from 0.6 to 0.9 and F from 0.1 to 0.3). In summary the association measures are (compared to baseline of noRxF):

Rx	RR=3	OR=3.9
M	RR=6	OR=13.5
RxM	RR=9 ↓	OR=81 ↑

Obviously the results are conflicting because these measures are measuring association differently and our point has been that no heterogeneity of treatment effects that are meaningful should be derived from measures of association that measure that association poorly.
The inequality of the association of treatment with death in strata of gender (or the non-null product term) can be called association modification. This, of course, happens commonly even if gender is not a cause of death or not associated with a cause of death e.g. because of artifacts of the sample. Because association modification is mathematically reciprocal – If gender modifies the Rx-death association then Rx modifies the gender-death association so it suffices to just call this association modification or a statistical product term and avoid use of the term interaction.
The unfortunate reality is that even when Rx and gender are both causes of death and even when confounders of their marginal association with death are absent and even when there is no artifact of the sample, there is still no guarantee that the association modification between Rx and gender with respect to death corresponds to causal effect modification if the measure of association is poor (i.e. measuring association poorly due to interference by baseline risk).
The last point above is what we are discussing in this thread. We have three common choices in medical decision making around association measures for binary outcomes – RD, RR or OR. If any of these measure the intended association poorly, then the association modification will correspond poorly to effect modification aka heterogeneity of treatment effects even if confounding or artifacts are absent. This is why we see different results from different measures. In the example above, it is quite clear that the RR is spurious for the RxM comparison to baseline (noRxF) as examination of the data clearly demonstrates that belonging to the RxM group gives the greatest increase in mortality. This sort of non-monotone relationship of measures with true association (many ways this can be measured) seems to be a hallmark of collapsible measures of association.

Sander · July 8, 2021, 4:02pm

Anders, I see you posted this:

Your site finally let me post the following comment which seems relevant here as well:
Tell us Anders, where did you find out about Abbott’s paper and why did you obtain and read it? After all, here at the Datamethods blog on July 4 you repeatedly dismissed my comments that mechanistic models and risk results go back to the 1920s, replying
“It therefore isn’t at all obvious to me that reading the old literature is generally the best use of time” and “I wouldn’t try to learn calculus from Newton or Leibniz’ original writings, and likewise, I wouldn’t try to learn causal inference from Robins (1986)”,
to which I replied at length, ending with:
“For those with a historical bent, in a paper invited by an engineering toxicology journal (“Elementary models for biological interaction”, Journal of Hazardous Materials 1985;10:449-454) I attempted to provide a connection between bioassay and epidemiologic models for interaction.” Among other things this 1985 paper cites W.S. Abbott, A method of computing the effectiveness of an insecticide, J. Econ. Entomol., 18 (1925) 265–267.

Also, you never responded to my deduction of risk additivity from the assumptions of no interaction response types and of no confounding of either single or joint factor effects.

s_doi · July 10, 2021, 12:27pm

To take this one step further, lets take the same example again (Sander’s JCE paper example). Lets denote X & Z as treatment & gender and denote death as Y. If we were to interpret these results in terms of marginal effects, we would typically look at the effect of the explanatory variables on the probability of death. However let us start with the logistic model and in the table below odds are presented and not probabilities and they are just the expected number of people who die for every person who lives. For a male the ratio is 9/1.5 = 6 and for a female that is 0.43/0.11 = 3.9. The interaction effect between X and Z is 1.55 which tells us (on the multiplicative scale) how much the effect of treatment differs between males and females – the effect of treatment for men is 6/3.9=1.55 times that for women.

We can compute the marginal effect as the difference in odds (which is what the fuss is all about) and from the table the marginal effect of treatment for women is 0.43-0.11 = 0.32 and for men is 9-1.5 = 7.5. As soon as we change to differences in odds we see that this is different from the multiplicative effect.

Predictive margins			Number	of	300
Model VCE : OIM
Expression : exp(xb())
over : z x
	Delta-method
	Margin	Std. Err.	z	P>z	[95% Conf.	Interval]
z#x
0 0	0.1111111	0.037037	3	0.003	0.0385199	0.1837024
0 1	0.4285714	0.093522	4.58	0	0.2452718	0.6118711
1 0	1.5	0.4330127	3.46	0.001	0.6513107	2.348689
1 1	9	4.242641	2.12	0.034	0.6845771	17.31542

We can do the same for the log risk model and here we get RR 3 in females, 1.5 in males - the effect of treatment for men is 1.5/3 = half of that for women. Obviously the discussion in this thread uses this as the standard against which the OR is judged and based on this the OR results are somehow biased. Of course, one could easily argue the opposite, but in order to resolve this debate we need to understand which argument is correct.

Predictive margins			Number	of	300
Model VCE : OIM
Expression : exp(xb())
over : z x
	Delta-method
	Margin	Std. Err.	z	P>z	[95% Conf.	Interval]
z#x
0 0	0.1	0.03	3.33	0.001	0.0412011	0.1587989
0 1	0.3	0.0458258	6.55	0	0.2101832	0.3898168
1 0	0.6	0.069282	8.66	0	0.4642097	0.7357903
1 1	0.9	0.0424264	21.21	0	0.8168458	0.9831542

In support of the log-risk argument has been one simple observation: non-collapsibility. This is not limited to medical researchers but if we look at Social Science (Mood (2010) cited 2665 times) the same argument is given as was echoed by Edward Norton (which was discussed in one of the threads in datamethods):
a) Somehow the OR reflects unobserved heterogeneity so it is not a substantive effect
b) Models with different independent variables cannot be compared with the OR because the unobserved heterogeneity varies across models
c) Even with the same samples, unobserved heterogeneity means non-comparable ORs across samples, groups or time.
As Buis has pointed out, one characteristic of logistic models that many of those that discussed the OR (e.g. Norton) find problematic is that if we add a variable to a logistic regression model, the effects of the other variable(s) is very likely to change, even if this additional variable is uncorrelated with the other explanatory variables. This is essentially the argument laid out by Sander in the JCE paper. The OR(XY) was 2.75 and addition of the gender variable (with (OR(XY)=3 or without (OR(XY)=4.5 the product term) to the model changes OR(XY) despite gender and treatment being uncorrelated. All that is required is that this new variable influences the outcome, it does not need to be correlated with the other explanatory variables and of course this is what Sander means by non-collapsibility. This does not happen in a log-risk model (with the product term) and hence a conclusion was drawn – non-collapsibility is bad because control variables are added to a model to adjust for possible confounders and if a variable is not a confounder we would like the effects of our explanatory variable of interest to not change when adding non-confounders. This has been the argument in the last 40 years – in a logistic regression the scale of the latent dependent variable changes when a new variable is added even when uncorrelated with the others or if the same variables are modeled in different samples etc.
What Buis has suggested previously and what Frank has been pointing out for a decade and what Doi et al paper suggests is that a logistic regression models linear effects of explanatory variables on the ln-odds of experiencing the event i.e. the log of the expected failures per successes (or vice versa). This is a fixed scale and nothing changes when uncorrelated variables are added to the model. So the effect on death in this example is influenced by many variables (even if we only knew two) and in any model this effect is conditional on what we know about the patient and is not an absolute property of the patient and learning more about the patient (gender) must change our perception of the chance that the patient dies attributable to treatment. As Buis has said, we are not uncertain about the patients in our dataset as we have already observed whether or not they die, we are using the model to make an assessment of how likely it is that someone with these characteristics will die. This is conditional on everything mentioned before – sample, variable choice, etc and this non-collapsibility is a desirable property of a logistic model because it reacts to new information as we would expect it to do. So when we added gender its not just the change in plausibility of the outcome due to adding gender but change in the effect of treatment due to adding extra information. Of course we can use any metric to measure this “expectation” – RR, RD, OR – but only the OR will have this desirable property because it is non-collapsible. This means that adding unconfounding prognostic variables should increase the effect of other variables, which is exactly what happens in logistic regression. As I have demonstrated in one of the posts above – if Z is a random variable not correlated with Y then the logistic model will still act as expected (no change in OR(XY) unless we add the product term because in this case we are more likely than not creating an artifact of the sample and therefore we have to be very careful to start from the position of homogeneity unless we have strong reasons not to.
Finally, yes, models with different variables added will have different OR(XY) as described above and yet Doi et al suggested that the OR is the preferred association measure for meta-analysis across different studies where the variables in the model and the sample are different. This in my view is akin to systematic error – so long as the different ORs vary on a magnitude less than the effect of treatment of interest, the meta-analytical OR will have clinical relevance otherwise the outcome is largely a function of these other variables. Can we use the RR in meta-analysis? Probably not and apart from the reasons above, there are other reasons suggested by Bakbergenuly , Hoaglin & Kulinskaya.

Sander · July 13, 2021, 3:30pm

For those interested in a part of the current stat literature far better informed about causality and much more relevant for patient-centered treatment-effect estimation than some of the comments still being posted here, have a look at Gao Hastie Tibshirani: Assessment of heterogeneous treatment effect estimation accuracy via matching
https://onlinelibrary.wiley.com/doi/full/10.1002/sim.9010
-It’s good to see that some veteran senior statisticians are capable of incorporating modern causal concepts into their models and methodology.

f2harrell · July 13, 2021, 7:51pm

I’d be more at ease with that approach if someone develops a way to calculate the needed sample size for reliability of results and the applications have at least that sample size.

Sander · July 13, 2021, 7:56pm

The point is to see what kind of model it takes to approach such problems. Besides, they had to start somewhere and no one paper can do everything. Also, it’s 2021 so why not e-mail them your concern? Wouldn’t take long and not likely they will see it here.

f2harrell · July 13, 2021, 8:19pm

I may do that. I’ve been corresponding with them on another matter. Since most clinical trials are undersized for estimating average treatment effects, I’d be somewhat surprised if differential treatment effect estimation is at all possible, most of the time.

HuwLlewelyn · December 11, 2021, 9:16pm

Collapsibility when randomising to different diagnostic tests

I waited for this pre-print by me to be made public today (https://arxiv.org/pdf/1808.09169.) before responding to the above discussion. I prove in section 6 of the pre-print that ORs are collapsible if and only if the likelihoods conditional on the both dichotomous outcomes (e.g. sensitivities and false positive rates) are the same for both control and active intervention. I included this discussion in the preprint because of a comments made by Judea Pearl about me not addressing issues of causality in a previous version.

I had regarded this condition regarding ORs as self evident in section 10 of my 2018 paper: The scope and conventions of evidence‐based medicine need to be widened to deal with “too much medicine” (https://onlinelibrary.wiley.com/doi/abs/10.1111/jep.12981). I too suggested in this paper that an assumption of constant ORs should replace the assumption of constant RRs during modelling of conditional probabilities on control and intervention as part of a reform of EMB’s conventions. This is because amongst other things we can get probabilities greater than one in our models with RRs. Suhail Doi, Frank Harrell et al referred this point in my paper in your rebuttal of Chu’s arguments and those of Leuch.

It does not seem possible for the strict collapsibility of ORs and the associated condition of equivalent likelihoods to be demonstrated precisely using stochastically variable data but only be assumed for the purpose of modelling. The resulting probabilities created by the model then have to be calibrated. The calibrated curves may then not demonstrate an OR or RR relationship and are unlikely to do so precisely. However, as new data is collected and used to modify the model and recalibrate it, things may change. It is possible that little calibration will be necessary for many data sets. In the meantime I think we should base our models on ORs rather than RRs and calibrate.

I would like to make another point. Marginal RRs always seem to be collapsible (but not marginal ORs). This is the basis of the simultaneous equations in section 2.1 of my preprint. The marginal priors giving rise to the RR (represented by ‘x’ in the equations) are the proportions of ‘outcome with test positive’ and the proportion of ‘outcome with test negative’ in the control and intervention sets. The ratios of these proportions collapse to the RR of the overall outcome proportions in the control and intervention sets. This collapsibility of all marginal RRs applies irrespective of issues of causality and the subsequent collapsibility of conditional RRs and ORs as shown in Figures 3, 5 and 6 of the preprint.

I would be grateful for your comments.

s_doi · December 13, 2021, 7:53am

I don’t think we should be surprised that the OR is collapsible only if the likelihood ratios for the control and intervention sets are identical. Essentially we should perhaps look at collapsibility as the abnormal state and noncollapsibility as the desired state of affairs and then we can put your observation into perspective as follows:
A good effect measure can only deviate from its expected behaviour (noncollapsibility) if (and only if) the covariate added to the model is not prognostic for the outcome

AndersHuitfeldt · December 13, 2021, 9:07am

In case anyone is still interested in this, I am posting the link to a talk I recently gave at the London School of Hygiene and Tropical Medicine, where I explain the ideas behind this preprint:
https://www.lshtm.ac.uk/research/centres/centre-statistical-methodology/seminar-recordings
(Go to talk 4 and click on “slides and audio”)

s_doi · December 13, 2021, 11:32am

Not just women in science - I was also one who went down this path a while back but have since realized that the right solution is the odds ratio…

AndersHuitfeldt · December 13, 2021, 12:40pm

Thank you for pointing that out, I will add a reference to this paper, as another instance of someone who independently came up with this idea. I am very curious: What made you change your mind?

HuwLlewelyn · December 13, 2021, 2:22pm

Thank you Suhail for your comment. However in order for me to understand your point better, can you explain what you mean by ‘abnormal’ (e.g. does it mean very rare or unusual)? Also could you explain what you mean by ‘not prognostic for the outcome’ with an example (e.g. does it mean that when M is the value of a dichotomous covariate (e.g. male) F is its complement (e.g. female) and O is the outcome, does it mean that p(O|M) = p(O|F) = p(O))? My understanding is that collapsibility of the OR means for example that it can be assumed to be constant for all numerical values of an albumin excretion rate so that the odds of nephropathy with treatment can be ‘calculated’ from the odds of nephropathy on control by multiplying or dividing the latter by the OR at any value of the AER. Does this understanding concur with yours?

s_doi · December 13, 2021, 4:20pm

Thanks Huw, I have posted a data table below that has y as the binary outcome (dead/alive), x as the intervention (Y/N) and gender as the third variable (M/F). This explains what I mean by not prognostic for the outcome i.e. no association between gender and y (independent of x). The table contains the proportion with the outcome Y=1 and cell frequency

By abnormal I also mean unusual property for an effect measure e.g. if we see an airplane that does not fly, that is an unusual property of an airplane so we question if this is indeed an airplane. Similarly when we see an effect measure that demonstrates collapsibility that (in my view) should be flagged as unusual so I then question if it is indeed an effect measure.