Should one derive risk difference from the odds ratio?

Prove it. And while you are at it: Can you please refute my impossibility theorem?

Update: If by “factors that make the outcomes have different tendencies for different subjects” you mean every cause of the outcome, then sure, the odds ratio will be stable, as will every other effect measure. In this situation, the choice becomes irrelevant. But this is just a theoretical curiosity, you will never ever be able to condition on every cause of the outcome

Much of what you wrote could be called a theoretical curiosity as much as what I wrote could.

I am not sure why it is not cherished and I assume you mean noncollapsibility. If so why is that considered a problem?. Say we start with a RCT we can assume there is no need to adjust for (say) gender because randomization implies that the unadjusted analysis is perhaps valid. However, we may wish to adjust for gender to look at potential heterogeneity and what has repeatedly been pointed out as problematic with logistic regression is that conditioning on a different prognostic covariate (say smoking; or even a different set of prognostic covariates) will lead to a different conditional estimand and this is supposedly not good. Why is this flagged as a problem? Shouldn’t this be the expected behavior of a true effect measure? Unless the latter can be disproved, we have a serious problem with collapsible effect measures.

Hi Anders

May I take you away from the theoretical to the practical. In your video to which you referred to above, you said that your aim was to help physicians like me to improve the way in which we made use of effect measures. This will help me to understand what you are doing.

This figure displays 3 curves. The top blue curve displays the estimated probabilities of nephropathy (as indicated by heavy urinary protein over a specified threshold) after 2 years on placebo on the vertical axis conditional on each possible baseline albumin excretion rate (AER). The middle curve shows the probability of nephropathy after 2 years on treatment, calculated assuming a constant OR.of 0.459 The bottom curve shows the probability of nephropathy on treatment assuming a constant RR of 0.499.

If you had calculated a curve showing the probability of nephropathy using your effect measure where do you think the curve would lie in relation to these 3 curves?

1 Like

Perhaps better to do a plot of Pr(𝑌 = 1|𝑋 = 1, AER) against Pr(𝑌 = 1|𝑋 = 0, AER) for each effect measure to compare better (where X is Irbesartan/placebo (1,0))

The graph based on the switch relative risk would be identical to the graph based on RR. This intervention reduces risk of the outcome, the switch relative risk therefore selects the standard risk ratio instead of the survival ratio. We will only see differences between the switch relative risk and RR when we consider adverse events

It is interesting how many of us in this discussion have a medical background. I left medicine after my internship (pre-registration house officer) in 2010, but my clinical background certainly informs how I conceptualize these methodological issues.

If you use the odds ratio to control for gender, the adjusted conditional results will differ from the marginal odds ratio even if the odds ratio in men is equal to the odds ratio in women, i.e. in the absence of both confounding and effect modification by gender.

I agree that in the presence of suspected heterogeneity, we should expect the true conditional effect in men to be different from the true conditional effect in women. This is a matter of effect heterogeneity, not collapsibility. But when the true conditional effect in men equals the true conditional effect in women, we should expect this to also equal to true conditional effect in the full population. This is not the case in the absence of collapsibility

No, the unadjusted OR assumes within-treatment homogeneity of probabilities of outcomes. Randomization does not save the day. Recall how Pearson developed the 2 \times 2 table \chi^{2} test, which assumes there are two constant probabilities p_{1}, p_{2}.

Adjusting for gender is not primarily to look for heterogeneity but rather to account for outcome heterogeneity.

That is a good thing. The two were never intended to be equal. And as I stated above, marginal (i.e., simple) ORs were invented only for the homogeneous case.

1 Like

By valid I meant no confounding but I take your point on this but how about the case where gender is not prognostic for the outcome?

Agreed but perhaps if you explain why this is bad we can then understand your position better. From my perspective this is good because we expect exactly this behavior from a good effect measure (if gender is prognostic for the outcome) because not conditioning on gender cannot really be expected to result in the same magnitude of the estimated effect. I keep on mentioning “prognostic” here for obvious reasons.

1 Like

Intended by whom? How is it possibly a good thing that the estimates change when controlling for sex even if treatment effect does not depend on it? Honestly, I am puzzled by this claim.

In my impossibility theorem, I show that if this is the case (i.e. if the effect measure is noncollapsible), then it is impossible to reach effect homogeneity by controlling for those covariates that predict the distribution of individual-level determinants of treatment effect, unless we control for so many things that every effect measure is homogeneous.

You are not even making an attempt to justify why this is good

If gender is not prognostic for either treatment group (i.e., there is no gender main effect and no interaction with treatment) then gender can be completely ignored.

Understanding that IMHO is crucial to the discussion. In the history of statistics, starting with Bernoulli and his attention to the one-sample problem where probabilities were homogeneous, to Pearson’s use of homogeneity in the two-sample proportions problem, to Gossett’s two-sample t-test assuming homogeneity of distributions within treatment group, to the first uses of risk and odds ratios, all of these assumed homogeneity. The same is true of the Cox proportional hazards model when there is no covariate adjustment. All of these measures were proposed for homogeneous cases and were not intended to work in other cases. For example when one fails to adjust for an important gender effect in the Cox model, the within-treatment survival distributions are messy mixtures of male and female distributions and non-proportional hazards for treatment will arise.

The field of statistics has spent a lot of time arguing over problems of our own creation by ignoring heterogeneity.

2 Likes

I fully agree that a lot of traditional work in statistics relies on unjustified homogeneity assumptions, and that this has caused considerable methodological confusion.

But I want to be clear that the marginal odds ratio is just a mathematical object, it isn’t something that was “invented” by some individual whose intentions are relevant to the discussion.

It wasn’t invented as much as it was am improper use for the simple odds ratio. Continuing to pursue methods built for homogeneity when there is heterogeneity is not very fruitful IMHO

To quote extensively from Kuha & Mills as well as Buis, if we view the outcome in terms of an assessment of how likely it is that an event occurs in a group of individuals, this will be dependent not only on the nature of the treatment and outcome, but also on the set of groups for whom the effect is evaluated. Causal effects are thus group dependent because individuals are heterogeneous in their responses to any treatment, and because groups are heterogeneous collections of individuals. Some of this heterogeneity is likely to be due to differences in other observable characteristics of the individuals that have causal effects of their own on the response.

Clearly then, regression assessment of how likely it is that an event occurs will depend on how much information we have and the ability of the chosen effect measure to correctly use this information. The correct effect measure should be able to measure the extent to which “how likely” is not the same across different groups. If obesity was unobserved, and obesity was prognostic for the outcome, in this group the modeled effect of treatment should be assessed of less magnitude than in a group where obesity was measured and included in the model. This is because the distribution of unmeasured obesity across treatment arms must influence our effect measures assessment of “how likely”. So adding the additional variable obesity to the model should increase the estimated effect of treatment, via the effect measure, even if treatment and obesity were uncorrelated. In other words, the coefficient in the regression equation does not represent a sort of universal constant. If obesity (whether observed or not) is inherently involved in what the average causal effect of treatment in a population means, then the distribution of obesity in that population sets the context in which the effect of the treatment is realized. We must not implicitly or explicitly think that estimates of effects for obese should also apply to non-obese or to some combination of obese and non-obese.

The odds ratios from a logistic regression show exactly this behavior because the odds ratios are noncollapsible. The odds ratios from a RCT will estimate effects that are group-specific quantities and will have the following properties:

  1. The treatment odds ratio will be dependent on which variables are included in the model which is not a problem but actually a requirement for the correct assessment of “how likely”.

  2. Odds ratios across models with different sets of explanatory variables are comparing across different groups and our estimate of “how likely” is supposed to change when groups are different. Only a comparison of noncollapsible effect measures across groups provides an accurate description of the difference in treatment effects across these groups.

The extent to which the probability of the outcome differs between groups defined by baseline covariates such as obesity, is captured by the parameter for the covariate. There is no difference between the odds ratio, the risk difference and the risk ratio in terms of their ability to capture any baseline risk differences between groups due to the confounders.

If obesity was unobserved, and obesity was prognostic for the outcome, in this group the modeled effect of treatment should be assessed of less magnitude than in a group where obesity was measured and included in the model.

You need to be much clearer about the role of the parameter for baseline risk, the parameter for obesity, and the parameter for the intervention.

This is because the distribution of unmeasured obesity across treatment arms must influence our effect measures assessment of “how likely”.

In a randomized trial (and in the absence of confounding), the distribution of obesity will be independent of treatment arms. Avoiding confusion about this is one of the primary reasons that I insist on the need for using counterfactuals

So adding the additional variable obesity to the model should increase the estimated effect of treatment, via the effect measure, even if treatment and obesity were uncorrelated. In other words, the coefficient in the regression equation does not represent a sort of universal constant. If obesity (whether observed or not) is inherently involved in what the average causal effect of treatment in a population means, then the distribution of obesity in that population sets the context in which the effect of the treatment is realized

There are two separate questions to consider here:

  1. Is the prevalence of obesity independent of treatment arm?
  2. Assuming that the prevalence of obesity is independent of treatment arm, should the effect of treatment depend on the prevalence of obesity?

As explained above, I am going to assume we are in a situation where question one can be answered in the affirmative. This is expected to hold by design in an (infinitely large and perfect) randomized trial, but whether it is true in the data is not really relevant if we just define our effect measures in terms of counterfactuals.

For the second question, i agree that it is possible that the prevalence of obesity determines treatment effects. But in order to violate collapsibility, it must be the case that in the hypothetical situation where the odds ratio is equal between the group where obesity=0 (i.e. in a group where the prevalence of obesity is 0% in both the intervention group and the control group) and the group where obesity=1 (where the prevalence is 100%), then the pooled odds ratio (i.e. in a group where the prevalence of obesity is somewhere between 0 and 100%) is not also equal. You are going to have to find some way to explain this U-shape.

I also note that nobody is claiming that regression equations represent universal constants. I will however insist that the validity of a model depends on the extent to which the homogeneity assumptions that define it are to some approximation reflective of biological reality.

2 Likes

As an undertrained reader following this thread and understanding very little of the exchange, it seems like these last few entries are highlighting fundamentally different worldviews (which may ultimately not be reconcilable?). The heated nature of the dialogue suggests that the question is very important. To this end, could it be valuable to translate the gist of the thread into a language that can be better understood by the masses?

Would this be a fair way to summarize the distinction between epidemiologists’ worldview and statisticians’ worldview?

Epidemiologists’ “default” position:

  • the world (including the human body) is a web of unrecognized cause/effect relationships;
  • for this reason, our default assumption should be that treatment effects will differ from one patient to the next, even if we have trouble providing definitive evidence that this phenomenon is widespread;
  • since the odds ratio does not seem sufficiently responsive to changes in patient-related variables (is this where the term “non-collapsibility” comes in?), it is a poor choice of effect measure and other measures are preferable (?);
  • statisticians hold a naive view of the world that does not reflect advances in epidemiological understanding of cause/effect. Their default is to assume that most events occur randomly, rather than for (as yet unidentified) reasons. This is why statisticians don’t seem bothered by the “non-collapsibility” of the odds ratio (?)

Statisticians’ “default” position:

  • the world (including the human body) is highly susceptible to random events; the extent to which randomness /stochasticity can explain the events we observe is under-appreciated by epidemiologists;
  • newer causal inference methods do not provide convincing evidence of the existence of a multitude of unrecognized cause/effect relationships in the world;
  • in contrast, it is possible to show that many of things we observe that might suggest inter-individual differences in treatment effects (e.g., apparent “subgroup” effects) can be explained on the basis of randomness/chance;
  • until there is compelling concrete (rather than theoretical) evidence of widespread between-patient heterogeneity in treatment response, the fully adjusted odds ratio remains a reasonable preferred choice of effect measure (?)

Clinically, there are clear instances where the same treatment has the potential to harm one patient but help another. Treating two patients with penicillin for tonsillitis could cure one but kill the other if the second patient has a history of penicillin-related anaphylaxis. Patients with gout who are of Han Chinese ethnicity are at higher risk for severe cutaneous reactions to allopurinol than patients of other ethnicities since they may be more likely to carry certain HLA alleles.

Aside from rare genetic/immunologically-based idiosyncratic adverse reactions, there are also much more common clinical scenarios where the same drug can affect two patients differently. Giving furosemide to a dehydrated patient will make him worse, but giving it to a patient with congestive heart failure could make him better. Importantly though, the deterioration of the first patient does not reflect an “intrinsic” harm from furosemide itself, but rather improper patient selection.

So the fact that some patients can be helped but others harmed by the same treatment is not in doubt. The main question seems to be the extent to which examples like those above (unpredictable/as-yet-unidentified idiosyncratic reactions or obvious improper clinical use) are relevant to the choice of RCT effect measure. Secondly, is it more likely than unlikely that there is a latent “web” of other unidentified factors that can influence a patient’s response to a given therapy? If there is such a web, should its existence influence the choice of effect measure in an RCT?

2 Likes

This would be my summary taking into account current practical considerations for physicians and other scientists. In an RCT or other such experimental studies a reasonable general approach is to:

  1. Use non-collapsible parameters such as ORs or HRs as summaries of relative treatment effect between groups. Adjust them for treatment effect heterogeneity by additively including strong prognostic factors in the model. This is standard practice and generally works well in oncology for example.

  2. Convert these parameters into clinically interpretable collapsible outputs such as risk ratios, which are ratios of probabilities. The switch risk ratio framework becomes a good guide during this process: if the tested treatment reduces risk of an event such as death versus control then use typical RR which uses death (and not survival) as the reference category. if the tested treatment increases risk then use the switch relative risk which reverses the reference category. Note that the OR is essentially a product of both relative risks.

  3. The approaches we typically use to determine effect heterogeneity at the OR or HR scales are causal (e.g., experiments in the lab) and the statistical debate regarding collapsibility will likely not help us much.

3 Likes

I wouldn’t agree because if that was the case this discussion would be unnecessary. Let’s first take the situation “in the absence of both confounding and effect modification”.

This is a table of frequencies. Clearly how likely there is death (Y) in treated (X) compared to untreated is the same on the RD scale but not OR scale when obesity is observed (and accounted for) versus unobserved.

Note that the distribution of obesity is independent of treatment arms in this example and obesity is prognostic for the outcome (much more so than is the treatment)