Should one derive risk difference from the odds ratio?

Can you explain to me the advantage of estimating the marginal RR from two different conditional RRs as you have done using ‘counterfactual collapsibility’?

One use case would be if you are trying to standardize the risk ratio from one setting to another.

To illustrate, suppose you have a randomized trial in Norway, and are interested in knowing the marginal risk ratio in Wales, where treatment is not yet available. You can then decompose the RR in Wales into RR(Wales) = RR(Welsh men) * Weight(Welsh men) + RR(Welsh women) *Weight(Welsh women). . If you are willing to assume that the RR(Welsh men) is equal to RR(Norwegian men), and that the same holds for women, you can then plug in the sex-conditional risk ratios from the Norwegian trial. The weights in Wales can be estimated from observational data in a population where treatment is unavailable.

Note in particular that since such collapsibility weights do not exist for the odds ratio, it would be impossible to standardize the conditional odds ratios from the Norwegian population into a valid estimate for the marginal odds ratio in Wales. If you wanted to estimate the marginal odds ratio in Wales, you would have to standardize the absolute probabilities, but this depends on stronger assumptions than stability of the relative risk.

2 Likes

s_doi: If both methods using data from the same trial give different results and if we agree that method A uses Bayes’ rule then by default method B must violate Bayes’ rule

Sorry, but I don’t think this follows. To give another example, suppose we have data from a trial and we want to estimate the OR using two Bayesian methods. Method A uses a strongly informative prior that OR < 1, while Method B uses a strongly informative prior that OR > 1. Even with the same data, this could obviously lead to different results of the same magnitude as in your example. So is Method A in violation of Bayes’ theorem, or is it Method B?

2 Likes

We are discussing something slightly different here - a prior probability is being updated with the likelihood of the observed trial data (similar to the result of a diagnostic test) to generate an updated (posterior) probability. The OR is just the ratio of two likelihoods. This is the Bayesian updating method A being compared to one where there is no such concept (method B). Two Bayesian updating methods are not being compared and neither are two prior probabilities

I can see that there is still confusion about what all this means in my posts so perhaps this might help:

new table

Thank you @AndersHuitfeldt. I have gone through the exercise of using the result of a RCT in a Norwegian cancer centre to estimate what would happen in Wales. Figure 1 shows a ‘P Map’ that sets out simple natural frequencies observed on treatment and control. (I use these ‘P Maps’ for teaching in the 0xford Handbook of Clinical Diagnosis.) There were 100 subjects in the control and treatment limbs and of these, 60/100 were female and 40/100 were male. Of the 60 females in the control limb, a proportion of 8/60 died. Of the 40 males in the control limb, 12 died. In the treatment limb, 6/60 females died giving a RR of 6/60 x 60/8 = 3/4. In the treatment limb, 6/40 males died giving a RR of 4/40 x 40/12 = 1/3.

The result of an observation study in a Welsh cancer centre is shown in Figure 2. There were 100 subjects in the study and of these, 40/100 were female and 60/100 male (different to the Norwegian study). Of the 40 females, a proportion of 8/40 died. Of the 60 males, 21 died. Assuming a RR of 3/4 for women, the expected proportion of women who would die with treatment in Wales would be 8/40 x 3/4 = 6/40. Assuming a RR of 1/3 for men, the expected proportion of men who would die with treatment in Wales would be 21/60 x 1/3 = 7/60.

Of the 29 people who died in the observation study, 8 were female and 21 were male. This gave a weight for Welsh men of 21/29 and a weight for Welsh women of 8/29. This gives RRmen x weight for Welsh men = 1/3 x 21/29 = 7/29. For women the RRwomen x weight for Welsh women = 3/4 x 8/29 = 6/29. The resulting estimation of the marginal RR is 7/29 + 6/29 = 13/29 = 0.448.

This result is confirmed by another method, which can be formulated with the insight provided by the ‘P Maps’ that display ‘natural frequencies’. This is done by combining the proportion of Welsh females who would be expected to die on treatment (6/40) with the proportion of Welsh males who would be expected to die on treatment (7/60) to give the proportion of Welsh people who would be expected to die on treatment to give (6+7)/(40+60) = 13/100. The latter represent collapsible natural frequencies. The proportion of Welsh people who died in the observation study was 29/100. From this we can calculate the predicted marginal RR of (13/100)/(29/100) = 0.448, a result that coincides with the approach based on counterfactuals.

‘Natural odds’ are also collapsible and can be combined in the same way. The natural odds for 7/60 are 7/53 and the natural odds for 6/40 are 6/34. These collapse to the natural odds corresponding to the natural frequency for 13/100 thus: (7+6)/(53+34) = 13/87. The marginal odds ratio is therefore (13/87)/(29/71) = 0.366. Note that probabilities, risk differences, risk ratios and odds ratios may not always be collapsible as pointed out by @Sander in his Noncollapsibility, confounding, and sparse-data bias. Part 2: What should researchers make of persistent controversies about the odds ratio? - ScienceDirect but natural frequencies and natural odds are always collapsible in the above way.

Thank you again Anders for your kindness and patience in explaining this approach to me and @f2harrell for helping us all to understand each other better.

1 Like

The perspective that @Sander has attempted to explain in a number of posts and papers might be more easily grasped by substituting “weighted average” (or expectation) for “collapsible effect measure.”

In actuarial science, the question on how to derive a fair price for an individual who needs to insure a risk is a fundamental one, and tools to analyze the situation from the group and individual level is known in insurance circles as credibility theory. It originated in property/casualty insurance, but there are adaptations to health. You might be more familiar with the term “partial pooling” of estimates, if I remember some of your comments on meta-analysis.

Credibility methods have very close relationships to Bayes estimators.

In credibility theory, the analyst is attempting to compute a price that characterizes the unique aspects of an individual risk with the historical, aggregate experience of the risk pool in question.

The best estimator is a weighted average of the prices, where “more credible” information (lower variance estimate) gets more weight. By convention, the credibility weight Z is applied to the individual component, and the remainder is derived by computing 1-Z for the class component. A credibility of 0 indicates the individual component is homogeneous, and provides no additional information not available from the class mean. A credibility of 1 indicates historical experience is not relevant, the estimates are heterogeneous.

The process is remarkably similar to random effects meta-analysis.

The notion of “best effect measure” seems to have an analog in insurance as “best loss reserve estimate.” For these, estimates that can be interpreted as weighted averages (aka “collapsible”) are seen as most relevant. None of that seems to stop actuaries from using models such as logistic regression, and then computing the needed estimate from it, as Sander recommends.

A medical analogy might be computing the QALY under treatment adjusted for adverse effects of the treatment, derived from a mechanistic understanding of drug action, for a particular patient.

Related Papers

Venter, G. (2003). Credibility Theory for Dummies. In CAS E-Forum (pp. 621-627).

The following description of “greatest accuracy credibility” would be compatible with the example in @AndersHuitfeldt post describing the use of data in one population to make inferences about another.

The greatest accuracy approach measures relevance as well as stability and looks for the weights that will minimize an error measure. The average of the entire class could be a very stable quantity, but if the members of the class tend to be quite different from each other, it could be of less relevance for any particular class. So the relevance of a wider class average to any member’s mean is inversely related to the variability among the members of the class

Blum, K. A., & Otto, D. J. (1998). Best estimate loss reserving: an actuarial perspective. In CAS Forum Fall (Vol. 1, No. 55, p. 101). (PDF)

After a long discussion of the properties of a “best estimate”, they conclude for actuarial applications (underscore in original, bold my emphasis):

It is the undiscounted, unmargined, unbiased, best estimate of the probability
weighted average of all possible unpaid loss amounts.

There is an extensive discussion of various measures of central tendency, and the mathematical justification for using weighted averages.

Nelder, J. A., & Verrall, R. J. (1997). Credibility theory and generalized linear models. ASTIN Bulletin: The Journal of the IAA, 27(1), 71-82. (PDF)

Norberg, R. (2004). Credibility theory. Encyclopedia of Actuarial Science, 1, 398-406. (PDF)

In actuarial parlance the term credibility was originally attached to experience rating formulas that were convex combinations (weighted averages) of individual and class estimates of the individual risk premium. Credibility theory, thus, was the branch of insurance mathematics that explored model-based principles for construction of such formulas. The development of the theory brought it far beyond the original scope so that in today’s usage credibility covers more
broadly linear estimation and prediction in latent variable models.

Here is a case study from a Society of Actuaries research paper.

Zhu, Z., Li, Z., Wylde, D., Failor, M., & Hrischenko, G. (2015). Logistic regression for insured mortality experience studies. North American Actuarial Journal, 19(4), 241-255. (PDF)

In summary, a logistic regression modeling approach allows use of less but more relevant data to address multiple challenges in quantifying insured mortality

1 Like

I understand that you’re not comparing two Bayesian methods, but that’s irrelevant to the point I was making. You stated that Method B “rejects Bayes’ theorem”. When asked for clarification, you made an if-then argument that said “If two methods disagree and if one method uses Bayes’ theorem, then the other method by default must violate Bayes’ rule”. My example serves as a counterexample showing why your reasoning isn’t correct, using two explicitly Bayesian methods just to emphasize the point. More generally, the concept of a method “violating Bayes’ theorem” is not really well-defined. As @AndersHuitfeldt has pointed out, probabilities and functions thereof can’t “violate” Bayes’ theorem. The only way to “violate Bayes’ theorem” that I can think of is to write it down incorrectly.

To go to your specific example, the discrepancy between the two methods stems from the fact that Method A assumes the OR in Stratum 1 is the same as in Stratum 2 while Method B assumes the RR in Stratum 1 is the same as in Stratum 2. But, as @HuwLlewelyn pointed out, the RR in Stratum 1 is not the same as in Stratum 2 in this dataset and so Method B makes a false assumption leading to an incorrect result. This has nothing to do with Bayes’ theorem or violations thereof and everything to do with the validity of assumptions about how various measures of effect vary across groups. In this particular example, the OR is the same across strata while the RR differs, so a method that makes the former correct assumption will lead to the correct answer while a method that makes the latter incorrect assumption will of course lead to a different, incorrect answer. But as, others have pointed out, there’s no reason to assume that this pattern holds more generally. Which measures of effect vary across strata in any particular context depends entirely on the underlying data generating mechanism. To make good assumptions and get valid results we therefore need to make good assumptions about the data generating mechanism i.e. we need to start from good assumptions about the underlying causal model. When @Sander says that you are lost in your math and that his objections are more primary, I believe this is what he is referring to. You have not made any errors in your calculations, but the calculations themselves are a red herring. What matters are the assumptions we make about how effect measures behave in our population(s) of interest, and rearranging terms in expressions for the OR/RR/RD according to Bayes’ rule is not going to provide any insight into that.

5 Likes

If you make assumptions about the data that lack an empirical framework then I guess we will just have to agree to disagree on this

Addendum: You say “Method A assumes the OR in Stratum 1 is the same as in Stratum 2 while Method B assumes the RR in Stratum 1 is the same as in Stratum 2
I think you have missed the point here. I was only interested in showing that predicted probabilities differ by both methods and therefore there has to be the correct basis for choosing one over the other

What makes you think there “has to be” a correct basis for choosing one over the other? Do you mean that the mathematics alone should unambiguously tells you which effect measure is “correct”?

Suppose we have the numbers “2” and “3”, and that we are unsure about whether we want to add them or multiply them. We notice that the first approach results in a “5” and the second approach in a “6”. Does that mean we know that one of the operations is “incorrect”? Do you think that we can mathematically prove that one of the operations is the “correct” choice? Do you think one of the approaches has to be inconsistent with the Peano Axioms?

In my view, if we want to know whether we should “add” or “multiply” these numbers, this depends on the context. In applied cases, it will depend on what the numbers are intended to represent. Do you have two apples and want to know how many apples you would have if you bought three more? Then you should probably use addition.

When we use probability theory in our epistemology for the purposes of making inferences from data, the “correct” operation depends on what physical objects and physical actions we are trying to represent using mathematical objects are mathematical operations. Mathematical logic alone is not sufficient. We use terms like “data generating mechanism” so that we can be precise about what underlying objects we are discussing.

In this case, sscogges is simply telling you that mathematics alone is never going to give you an unambiguous answer for what effect measure to choose. This point alone is sufficient to refute your claims about the odds ratio, and it does not depend on any “assumptions that lack an empirical framework”.

One might then suggest that one way to get to an answer about which approach to use, would be to make justifiable and transparent assumptions about the data generating mechanism. This may not be ideal, but at least it allows transparent scientific discussion about plausibility of the analytic choices. Perhaps more importantly, nobody has suggested a better alternative.

Question for discussion: Is anyone other than me getting concerned yet about the fact that BMJ Evidence-Based Medicine published a paper by Suhail (Likelihood ratio interpretation of the relative risk | BMJ Evidence-Based Medicine) in which the highlighted key message is that “The interpretation of the risk ratio as an effect measure in a clinical trial is naïve and better replaced by its interpretation as a likelihood ratio” because of this alleged “conflict with Bayes theorem”?

2 Likes

Not clear why average risks are interesting.

Read the paper I linked to above on best estimate loss reserving. To allocate resources adequately, a decision maker needs weighted averages in order to estimate future claims and maintain adequate reserves. This is entirely not controversial in the actuarial realm, where organizations bear the cost for miscalculation.

This would be very important to an Accountable Care Organization (ie. a PACE program, for example) that receives capitated payments for Medicare beneficiaries. You would very much like to know if you have members that have a utilization risk higher than what the HCFA calculated premium considers as “fair.”

Accountable Care Models from American Academy of Actuaries

1 Like

Hi again Suhail
I would be grateful for clarification @s_doi . In the BMJEBM (#abstract#sorry=>) rapid response - see below you state that p(Y1|Z1, C) = [0.2] and p(Y1|Z1, T) = [0.4]. Also, please give the marginal probabilities [by inserting in the brackets] p(Y1|C) = [ ], p(Y1|T) = [ ] and also P(Z1|Y1, C) = [ ], P(Z1|Y0, C) = [ ], P(Z1|Y1, T) = [ ] and P(Z1|Y0, T) = [ ].

1 Like

Hi Huw, the paper does not have an abstract … please update your post to which example you are referring to.

Okay, that was not the abstract - that was a rapid response
and we only use marginal probabilities here and state that p(Y1|C) = [0.2] and p(Y1|T) = [0.4]. and then update the p(Y1|C)=[0.5] in another population to deliver a prediction for p(Y1|T) by applying the trial results under the two methods above (A and B)

Thank you Suhail. If we use the terms of Bayes rule then p(Y1|Z1, T)/p(Y1|Z1, C) = 0.4/0.2 = 2 is a first risk ratio and p(Y0|Z1, T)/p(Y0|Z1, C) = 0.6/0.8 = 0.75 is a second risk ratio). By dividing the 1st risk ratio by the 2nd risk ratio (i.e. 2/0.75 = 2.67) you have created a ratio of risk ratios NOT a ratio of likelihood ratios (i.e. in the nomenclature of Bayes rule).

The ratios of likelihood ratios would be p(Z1|Y1, T)/p(Z1|Y0, T) (the 1st likelihood ratio) divided by p(Z1|Y1, C)/p(Z1|Y0, C) (a 2nd likelihood ratio in the nomenclature of Bayes rule) to give a ratio of likelihood ratios.

By changing the nomenclature you are confusing me. If you were able to provide example probabilities for the empty brackets in my earlier post from your BMJ-EBM paper I might be able to understand your reasoning. So in summary, my understanding of the Bayes rule nomenclature is that p(Y|Z) is a risk or posterior probability, p(Z|Y) is a likelihood and p(Y) is a prior probability or marginal risk. I think that @ESMD (from the ‘peanut gallery’) is right in that many of our difficulties in these discussions are due to problems with nomenclature.

1 Like

@AndersHuitfeldt has helped to transform my understanding of effect measure collapsibility. I understand that one purpose of collapsibility to epidemiologists is to be to try to use the marginal RRs or ORs of a RCT from those observed in one population (e.g. in Norway) to estimate what might be observed in a different population (e.g. in Wales) where the conditional risks based on a covariate (e.g. males / females) are different. However this can be done using the collapsible property of natural frequencies and natural odds, which I have known as long as I have been using ‘P Maps’ to be always collapsible (although I did not use that word). This is explained in my previous post ( Should one derive risk difference from the odds ratio? - #540 by HuwLlewelyn ). This illustrates the point about nomenclature made by @ESMD and perhaps the reason that it has taken me so long to understand @Sander and others.

What interests me as a clinician however, is how to apply the result of a RCT to patients with the same disease but various degrees of disease severity (fundamental to clinical practice) and also to apply them to patients with different baseline risks for other reasons (e.g. the presence of multiple covariates that change the baseline risks). This can be helped when conditional and marginal effect measures such as RRs and ORs are identical as well as being natural frequency and natural odds collapsible.

The RR and OR can be very helpful for low risks encountered in epidemiology (when they can be very similar numerically). However the RR not only gives implausibly large RDs at high probabilities encountered in the clinical setting familiar to me but also probabilities > 1 if the treatment increases risk (a problem that can be prevented by the switch RR of course). The OR does not have these bad habits and allows the marginal OR to be transported plausibly to baseline probabilities that are higher and lower than the marginal probability in a convenient way. However the OR is not an accurate model because when the marginal and conditional odds ratios are the same, p(Z=1|X=0) ≠ p(Z=1|X=0).

Another approach is to fit logistic regression functions independently to both the control and treatment data, but this too can provide biases (e.g. due to choice of the training odds used). My solution is to use any of the above when convenient for practical reasons and then to calibrate the results by regarding the ‘calculated curves’ as being provisional. I explained how I do this in another recent post: Risk based treatment and the validity of scales of effect , that also explores a different aspect of causal inference when dealing with multiple covariates.

Hi Huw, p(Y1|T)/p(Y1|C) and p(Y0|T)/p(Y0|C) are both RRs and LRs in the nomenclature of Bayes’ rule. In this direction they are LRs for Y as a test for T status.

In the other direction T as a test of Y status, you mixed up the nomenclature and this should be p(T|Y1)/p(T|Y0) and p(C|Y1)/p(C|Y0). These two are also LRs but this time are NOT RRs.

Note also that:
p(Y1|T)/p(Y1|C) divided by p(Y0|T)/p(Y0|C)
is exactly the same as:
p(T|Y1)/p(T|Y0) divided by p(C|Y1)/p(C|Y0)
That is why I can use either ratio of LRs interchangeably.

Addendum: Z removed for clarity

What you mean to say here is “…when the conditional odds ratios are the same, it does not equal the marginal odds ratio”. This is the definition of noncollapsibility and this whole discussion started because the causal inference group could not understand why this happens and called it an inaccuracy just as you did. What is the evidence that this is an inaccuracy? Just because we cannot get a weighted average of the conditionals to deliver the marginal does not imply inaccuracy and there needs to be a valid reason before noncollapsibility can be tagged “bad”

Also, averaging as you have indicated above is not interesting for decision making as Frank has said - please see Frank’s blog here

There is no “causal inference group” in this discussion. I speak as an individual, not as a member of some united front of causal inference researchers. Moreover, I understand “why this happens” (i.e. why the odds ratio is non-collapsible). I never called it an “inaccuracy”, can you please point out where you think this term was used?

The key question for discussion is whether we ever have reason for believing that an effect measure might be stable between different groups, and if so, whether we can pinpoint the applications in which such stability is expected to hold. Non-collapsibility is somewhat relevant, in part because I provided a proof that certain broad classes of data generating mechanisms will never lead to stability of a non-collapsible effect measure. But really, non-collapsibility is a red herring, since nobody even tried to formalize a biological generative model that leads to such stability. I don’t understand how collapsibility has taken on such an outsize role in this conversation.

The chance that conditional risks are different is lower than the near certainty that marginal effects are different, due to shifts from predominantly high-risk patients to predominantly low-risk patients or vice-versa. Still trying to wrap my head around the utility of marginal estimates.

1 Like

This approach doesn’t use what we’ve learned about predictive modeling in the past 40 years. It should be contrasted with an additive (in the logit) model that properly penalizes all the interaction terms that are implied by the separate fitting approach, to avoid overfitting.

1 Like