To take this one step further, lets take the same example again (Sander’s JCE paper example). Lets denote X & Z as treatment & gender and denote death as Y. If we were to interpret these results in terms of marginal effects, we would typically look at the effect of the explanatory variables on the probability of death. However let us start with the logistic model and in the table below odds are presented and not probabilities and they are just the expected number of people who die for every person who lives. For a male the ratio is 9/1.5 = 6 and for a female that is 0.43/0.11 = 3.9. The interaction effect between X and Z is 1.55 which tells us (on the multiplicative scale) how much the effect of treatment differs between males and females – the effect of treatment for men is 6/3.9=1.55 times that for women.
We can compute the marginal effect as the difference in odds (which is what the fuss is all about) and from the table the marginal effect of treatment for women is 0.43-0.11 = 0.32 and for men is 9-1.5 = 7.5. As soon as we change to differences in odds we see that this is different from the multiplicative effect.
Predictive margins
Number
of
obs =
300
Model VCE : OIM
Expression : exp(xb())
over : z x
Delta-method
Margin
Std. Err.
z
P>z
[95% Conf.
Interval]
z#x
0 0
0.1111111
0.037037
3
0.003
0.0385199
0.1837024
0 1
0.4285714
0.093522
4.58
0
0.2452718
0.6118711
1 0
1.5
0.4330127
3.46
0.001
0.6513107
2.348689
1 1
9
4.242641
2.12
0.034
0.6845771
17.31542
We can do the same for the log risk model and here we get RR 3 in females, 1.5 in males - the effect of treatment for men is 1.5/3 = half of that for women. Obviously the discussion in this thread uses this as the standard against which the OR is judged and based on this the OR results are somehow biased. Of course, one could easily argue the opposite, but in order to resolve this debate we need to understand which argument is correct.
Predictive margins
Number
of
obs =
300
Model VCE : OIM
Expression : exp(xb())
over : z x
Delta-method
Margin
Std. Err.
z
P>z
[95% Conf.
Interval]
z#x
0 0
0.1
0.03
3.33
0.001
0.0412011
0.1587989
0 1
0.3
0.0458258
6.55
0
0.2101832
0.3898168
1 0
0.6
0.069282
8.66
0
0.4642097
0.7357903
1 1
0.9
0.0424264
21.21
0
0.8168458
0.9831542
In support of the log-risk argument has been one simple observation: non-collapsibility. This is not limited to medical researchers but if we look at Social Science (Mood (2010) cited 2665 times) the same argument is given as was echoed by Edward Norton (which was discussed in one of the threads in datamethods):
a) Somehow the OR reflects unobserved heterogeneity so it is not a substantive effect
b) Models with different independent variables cannot be compared with the OR because the unobserved heterogeneity varies across models
c) Even with the same samples, unobserved heterogeneity means non-comparable ORs across samples, groups or time.
As Buis has pointed out, one characteristic of logistic models that many of those that discussed the OR (e.g. Norton) find problematic is that if we add a variable to a logistic regression model, the effects of the other variable(s) is very likely to change, even if this additional variable is uncorrelated with the other explanatory variables. This is essentially the argument laid out by Sander in the JCE paper. The OR(XY) was 2.75 and addition of the gender variable (with (OR(XY)=3 or without (OR(XY)=4.5 the product term) to the model changes OR(XY) despite gender and treatment being uncorrelated. All that is required is that this new variable influences the outcome, it does not need to be correlated with the other explanatory variables and of course this is what Sander means by non-collapsibility. This does not happen in a log-risk model (with the product term) and hence a conclusion was drawn – non-collapsibility is bad because control variables are added to a model to adjust for possible confounders and if a variable is not a confounder we would like the effects of our explanatory variable of interest to not change when adding non-confounders. This has been the argument in the last 40 years – in a logistic regression the scale of the latent dependent variable changes when a new variable is added even when uncorrelated with the others or if the same variables are modeled in different samples etc. What Buis has suggested previously and what Frank has been pointing out for a decade and what Doi et al paper suggests is that a logistic regression models linear effects of explanatory variables on the ln-odds of experiencing the event i.e. the log of the expected failures per successes (or vice versa). This is a fixed scale and nothing changes when uncorrelated variables are added to the model. So the effect on death in this example is influenced by many variables (even if we only knew two) and in any model this effect is conditional on what we know about the patient and is not an absolute property of the patient and learning more about the patient (gender) must change our perception of the chance that the patient dies attributable to treatment. As Buis has said, we are not uncertain about the patients in our dataset as we have already observed whether or not they die, we are using the model to make an assessment of how likely it is that someone with these characteristics will die. This is conditional on everything mentioned before – sample, variable choice, etc and this non-collapsibility is a desirable property of a logistic model because it reacts to new information as we would expect it to do. So when we added gender its not just the change in plausibility of the outcome due to adding gender but change in the effect of treatment due to adding extra information. Of course we can use any metric to measure this “expectation” – RR, RD, OR – but only the OR will have this desirable property because it is non-collapsible. This means that adding unconfounding prognostic variables should increase the effect of other variables, which is exactly what happens in logistic regression. As I have demonstrated in one of the posts above – if Z is a random variable not correlated with Y then the logistic model will still act as expected (no change in OR(XY) unless we add the product term because in this case we are more likely than not creating an artifact of the sample and therefore we have to be very careful to start from the position of homogeneity unless we have strong reasons not to.
Finally, yes, models with different variables added will have different OR(XY) as described above and yet Doi et al suggested that the OR is the preferred association measure for meta-analysis across different studies where the variables in the model and the sample are different. This in my view is akin to systematic error – so long as the different ORs vary on a magnitude less than the effect of treatment of interest, the meta-analytical OR will have clinical relevance otherwise the outcome is largely a function of these other variables. Can we use the RR in meta-analysis? Probably not and apart from the reasons above, there are other reasons suggested by Bakbergenuly , Hoaglin & Kulinskaya.
For those interested in a part of the current stat literature far better informed about causality and much more relevant for patient-centered treatment-effect estimation than some of the comments still being posted here, have a look at Gao Hastie Tibshirani: Assessment of heterogeneous treatment effect estimation accuracy via matching https://onlinelibrary.wiley.com/doi/full/10.1002/sim.9010
-It’s good to see that some veteran senior statisticians are capable of incorporating modern causal concepts into their models and methodology.
I’d be more at ease with that approach if someone develops a way to calculate the needed sample size for reliability of results and the applications have at least that sample size.
The point is to see what kind of model it takes to approach such problems. Besides, they had to start somewhere and no one paper can do everything. Also, it’s 2021 so why not e-mail them your concern? Wouldn’t take long and not likely they will see it here.
I may do that. I’ve been corresponding with them on another matter. Since most clinical trials are undersized for estimating average treatment effects, I’d be somewhat surprised if differential treatment effect estimation is at all possible, most of the time.
Collapsibility when randomising to different diagnostic tests
I waited for this pre-print by me to be made public today (https://arxiv.org/pdf/1808.09169.) before responding to the above discussion. I prove in section 6 of the pre-print that ORs are collapsible if and only if the likelihoods conditional on the both dichotomous outcomes (e.g. sensitivities and false positive rates) are the same for both control and active intervention. I included this discussion in the preprint because of a comments made by Judea Pearl about me not addressing issues of causality in a previous version.
I had regarded this condition regarding ORs as self evident in section 10 of my 2018 paper: The scope and conventions of evidence‐based medicine need to be widened to deal with “too much medicine” (https://onlinelibrary.wiley.com/doi/abs/10.1111/jep.12981). I too suggested in this paper that an assumption of constant ORs should replace the assumption of constant RRs during modelling of conditional probabilities on control and intervention as part of a reform of EMB’s conventions. This is because amongst other things we can get probabilities greater than one in our models with RRs. Suhail Doi, Frank Harrell et al referred this point in my paper in your rebuttal of Chu’s arguments and those of Leuch.
It does not seem possible for the strict collapsibility of ORs and the associated condition of equivalent likelihoods to be demonstrated precisely using stochastically variable data but only be assumed for the purpose of modelling. The resulting probabilities created by the model then have to be calibrated. The calibrated curves may then not demonstrate an OR or RR relationship and are unlikely to do so precisely. However, as new data is collected and used to modify the model and recalibrate it, things may change. It is possible that little calibration will be necessary for many data sets. In the meantime I think we should base our models on ORs rather than RRs and calibrate.
I would like to make another point. Marginal RRs always seem to be collapsible (but not marginal ORs). This is the basis of the simultaneous equations in section 2.1 of my preprint. The marginal priors giving rise to the RR (represented by ‘x’ in the equations) are the proportions of ‘outcome with test positive’ and the proportion of ‘outcome with test negative’ in the control and intervention sets. The ratios of these proportions collapse to the RR of the overall outcome proportions in the control and intervention sets. This collapsibility of all marginal RRs applies irrespective of issues of causality and the subsequent collapsibility of conditional RRs and ORs as shown in Figures 3, 5 and 6 of the preprint.
I don’t think we should be surprised that the OR is collapsible only if the likelihood ratios for the control and intervention sets are identical. Essentially we should perhaps look at collapsibility as the abnormal state and noncollapsibility as the desired state of affairs and then we can put your observation into perspective as follows: A good effect measure can only deviate from its expected behaviour (noncollapsibility) if (and only if) the covariate added to the model is not prognostic for the outcome
Thank you for pointing that out, I will add a reference to this paper, as another instance of someone who independently came up with this idea. I am very curious: What made you change your mind?
Thank you Suhail for your comment.However in order for me to understand your point better, can you explain what you mean by ‘abnormal’ (e.g. does it mean very rare or unusual)? Also could you explain what you mean by ‘not prognostic for the outcome’ with an example (e.g. does it mean that when M is the value of a dichotomous covariate (e.g. male) F is its complement (e.g. female) and O is the outcome, does it mean that p(O|M) = p(O|F) = p(O))? My understanding is that collapsibility of the OR means for example that it can be assumed to be constant for all numerical values of an albumin excretion rate so that the odds of nephropathy with treatment can be ‘calculated’ from the odds of nephropathy on control by multiplying or dividing the latter by the OR at any value of the AER. Does this understanding concur with yours?
Thanks Huw, I have posted a data table below that has y as the binary outcome (dead/alive), x as the intervention (Y/N) and gender as the third variable (M/F). This explains what I mean by not prognostic for the outcome i.e. no association between gender and y (independent of x). The table contains the proportion with the outcome Y=1 and cell frequency
By abnormal I also mean unusual property for an effect measure e.g. if we see an airplane that does not fly, that is an unusual property of an airplane so we question if this is indeed an airplane. Similarly when we see an effect measure that demonstrates collapsibility that (in my view) should be flagged as unusual so I then question if it is indeed an effect measure.
Thank you again Suhail. By the way there is a small typo in the top right and corner of your table: that probability should be 0.485 not 0.495. My understanding of what you said previously was correct in that if p(O|M) = p(O|F) = p(O)) for X=0 and X=1 then M and F being ‘not prognostic’ is a sufficient condition for these probabilities to form collapsible RRs… However p(O|M), p(O|F) and p(O) are also collapsible between Figures 3 (control) and 5 (intervention) in my preprint (https://arxiv.org/pdf/1808.09169 .)despite p(O|M) ≠ p(O|F) ≠ p(O) in Figures 3 and 5. This is because p(PCRpos) = 343/100,000 and p(PCRneg) = .99657/100000 are the same in Figure 3 (for control) and Figure 5 (for intervention, this being another sufficient condition for the conditional RR to be collapsible. However from your table, p(M)= 105/200 in X=0 and p(M)=105/200 in X=1 are also the same so this also satisfies my condition for conditional RRs to be collapsible. Your example is therefore a special case of the general condition for the RRs to be collapsible if and only the prior probability of the covariate is the same in the control and intervention set.
Thank you also for confirming that by ‘abnormal’ you mean rare.
Instead of:
However p(O|M), p(O|F) and p(O) are also collapsible between Figures 3 (control) and 5 (intervention) in my pre-print despite p(O|M) ≠ p(O|F) ≠ p(O) in Figures 3 and 5
Perhaps it would have been clearer if I had said:
However p(Viral spread|PCRpos), p(Viral spread|PCRneg) and p(Viral spread) are also collapsible between Figures 3 (control) and 5 (intervention) in my pre-print despite p(Viral spread|PCRpos) ≠ p(Viral spread|PCRneg) ≠ p(Viral spread) in Figures 3 and 5.
Thank you. I listened to your talk with much interest. You emphasised that there are many approaches and you explained your choice very clearly. Can you summarise your reasons for not using the odds ratio please?
The primary reason I don’t like the odds ratio, is that it is very hard (in my view: probably impossible) to construct a hypothetical biological system (“causal model”) with a biologically interpretable mechanism that leads to stability of the odds ratio.
Scientists often have to make assumptions of the type “I do not know the true value of the effect parameter (RR/OR/RD), but I am willing to assume that it is equal between men and women”. The goal here is to evaluate the plausibility of that assumption. To do so, we need a biological model which is specified sufficiently for the effect parameter to be stable, but not sufficiently specified for the effect parameter to be known. We can then evaluate to what extent our beliefs about biology differs from that model; if it seems like it describes our biological beliefs well, we have reason to expect effect homogeneity.
Since nobody has proposed a biological model that leads to stability of the odds ratio (and since in my view, this would be impossible), it is not possible to evaluate to what extent our beliefs about biology differs from a model with stable odds ratios.
In Sections 6.1 to 6.5 of my above pre-print on T, T & I for Covid-19 I give examples of one causal mechanism to prevent viral spread that leads to collapsible RRs and another causal mechanism for preventing viral spread that leads to collapsible ORs.
If we isolate individuals who are PCR positive, then this action does not change the proportion of potential spreaders who tested positive, so p(PCRpos) = 343/100000 is the same in the control (see Figure 3) and intervention / isolation group (see Figure 5). As I explained to Suhail above, this results in collapsibility of the RR. For example, in Figure 3, p(Viral spread|PCRpos^Control) = 0.7 whereas p(Viral spread|PCRpos^Isolation) = 0.175, a RR of 0.175/0.7 = 0.25 Thus the RR for 100/100000 and 400/100000 is also 0.25.
If instead of isolation, we give an instantly active antiviral drug immediately to those testing positive, then this does reduce the proportion with a positive PCR. We can assume that the marginal RR of 0.25 is the same as for isolation. However the effect on conditional probabilities can be modelled with a constant odds ratio where the ‘sensitivity’ and false positive rates are the same in the control (Figure 3) and intervention group (Figure 6). Thus in the control group p(Viral spread|PCR pos^Control) = 0.7 again (Figure 3), BUT in the treated group p(Viral spread|PCR pos^Treatment) = 0.367 (Figure 6). This gives an odds ratio of about 0.249 which is collapsible e.g. the odds ratio for 100/100000 and 400/100000 is also 0.249. However the risk reduction is lower for this mechanism of reducing spread at only 0.52 instead of 0.25 for the above collapsible RR example. These examples are based on simulated data so the observed p(Viral spread|PCRpos^Intervention) were not available.
I did not draw P Maps for the other example in the pre-print on Albumin Excretion Rate and Nephropathy. However, real data are available for this example, and tantalisingly, the observed value of p(Nephropathy|AERpos) was about half way between what we would expect from assuming a collapsible OR and what we would expect from a collapsible RR. However, from the theory of the mechanism of action of the treatment (with an angiotensin receptor blocker) the AERpos should fall, so the most plausible assumption is that the OR will be collapsible and the RR not so. I am currently writing another paper / preprint on this.
So in conclusion, I agree with you that it should not be said that assuming RR or OR in our model is wrong. The choice depends on the theoretical causal mechanisms, as you suggest.
The biological model issue applies equally to all one-number effect summary measures. For these issues I’m just pragmatic—which measure provides the most concise transportable summary, i.e., requires the fewest interaction terms in the statistical model.
No, it doesn’t apply equally to all one-number effect summary measures. The entire point of all my work is to show that we have a biological model that corresponds to Sheps’ preferred variant of the relative risk. Can you please either refute my work, or stop making claims like this?