Should one derive risk difference from the odds ratio?

In this dataset, the collapsibility weights for men is 3/4 (calculated as the probability of being a man, given that you would have died if you did not take treatment) and the collapsibility weights for women is 1/4 (probability of being a woman, given that you would have died if you did not take treatment). Using these weights, it is easy to observe that RR(men) * 3/4 + RR(women) * 1/4 = 1.5 * 3/4 + 3* 1/4 =1.875, which equals the marginal (“collapsed”) RR.

This only works because the dataset was unconfounded (probability of treatment was equal between men and women).

1 Like

Excellent. Thank you. Are you able to spell out your calculation of the probability of being a man, given that you would have died if you did not take treatment (i.e. 3/4) and the probability of being a woman, given that you would have died if you did not take treatment (i.e. 1/4)? Does this involve estimating the probability of an individual (1) dying with and without treatment (never survive) (2) surviving with and without treatment (always survive) (3) dying without treatment, surviving with treatment (benefit) and (4) dying with treatment, surviving without treatment (harm)? Why cannot a calculation give RR*(weight for women) = 1.875 and RR*(weight for men) = 1.875 as this would represent my understanding of collapsibility?

If there is no confounding (e.g. a randomized trial), you can calculate this by conditioning on being someone who was untreated and died, and then calculate their probability of being in a stratum. For example, in Table 1 of Sander’s paper, there were 30 men and 10 women who were untreated and died. Among these 40 people, there was a 3/4 chance of being a man, and a 1/4 chance of being a woman.

1 Like

Can your calculation give RR*(weight for women) = 1.875 and RR*(weight for men) = 1.875 as this would represent my understanding of collapsibility?

No, that is not the definition I am using. Collapsibility does not necessarily mean that the effect measure takes the same value in two strata. (But most definitions of collapsibility imply that if the effect measure takes the same value in two strata, then the marginal effect measure takes the same value)

OK. If the marginal and conditional (from a covariate) RDs, RRs and ORs are the same (one meaning of collapsibility), the marginal results of a RCT is often used to estimate the effect of treatment at different baseline probabilities conditional on that covariate (rightly or wrongly). However, can you explain to me the advantage of estimating the marginal RR from two different conditional RRs as you have done?

Can you explain to me the advantage of estimating the marginal RR from two different conditional RRs as you have done using ‘counterfactual collapsibility’?

One use case would be if you are trying to standardize the risk ratio from one setting to another.

To illustrate, suppose you have a randomized trial in Norway, and are interested in knowing the marginal risk ratio in Wales, where treatment is not yet available. You can then decompose the RR in Wales into RR(Wales) = RR(Welsh men) * Weight(Welsh men) + RR(Welsh women) *Weight(Welsh women). . If you are willing to assume that the RR(Welsh men) is equal to RR(Norwegian men), and that the same holds for women, you can then plug in the sex-conditional risk ratios from the Norwegian trial. The weights in Wales can be estimated from observational data in a population where treatment is unavailable.

Note in particular that since such collapsibility weights do not exist for the odds ratio, it would be impossible to standardize the conditional odds ratios from the Norwegian population into a valid estimate for the marginal odds ratio in Wales. If you wanted to estimate the marginal odds ratio in Wales, you would have to standardize the absolute probabilities, but this depends on stronger assumptions than stability of the relative risk.

2 Likes

s_doi: If both methods using data from the same trial give different results and if we agree that method A uses Bayes’ rule then by default method B must violate Bayes’ rule

Sorry, but I don’t think this follows. To give another example, suppose we have data from a trial and we want to estimate the OR using two Bayesian methods. Method A uses a strongly informative prior that OR < 1, while Method B uses a strongly informative prior that OR > 1. Even with the same data, this could obviously lead to different results of the same magnitude as in your example. So is Method A in violation of Bayes’ theorem, or is it Method B?

2 Likes

We are discussing something slightly different here - a prior probability is being updated with the likelihood of the observed trial data (similar to the result of a diagnostic test) to generate an updated (posterior) probability. The OR is just the ratio of two likelihoods. This is the Bayesian updating method A being compared to one where there is no such concept (method B). Two Bayesian updating methods are not being compared and neither are two prior probabilities

I can see that there is still confusion about what all this means in my posts so perhaps this might help:

new table

Thank you @AndersHuitfeldt. I have gone through the exercise of using the result of a RCT in a Norwegian cancer centre to estimate what would happen in Wales. Figure 1 shows a ‘P Map’ that sets out simple natural frequencies observed on treatment and control. (I use these ‘P Maps’ for teaching in the 0xford Handbook of Clinical Diagnosis.) There were 100 subjects in the control and treatment limbs and of these, 60/100 were female and 40/100 were male. Of the 60 females in the control limb, a proportion of 8/60 died. Of the 40 males in the control limb, 12 died. In the treatment limb, 6/60 females died giving a RR of 6/60 x 60/8 = 3/4. In the treatment limb, 6/40 males died giving a RR of 4/40 x 40/12 = 1/3.

The result of an observation study in a Welsh cancer centre is shown in Figure 2. There were 100 subjects in the study and of these, 40/100 were female and 60/100 male (different to the Norwegian study). Of the 40 females, a proportion of 8/40 died. Of the 60 males, 21 died. Assuming a RR of 3/4 for women, the expected proportion of women who would die with treatment in Wales would be 8/40 x 3/4 = 6/40. Assuming a RR of 1/3 for men, the expected proportion of men who would die with treatment in Wales would be 21/60 x 1/3 = 7/60.

Of the 29 people who died in the observation study, 8 were female and 21 were male. This gave a weight for Welsh men of 21/29 and a weight for Welsh women of 8/29. This gives RRmen x weight for Welsh men = 1/3 x 21/29 = 7/29. For women the RRwomen x weight for Welsh women = 3/4 x 8/29 = 6/29. The resulting estimation of the marginal RR is 7/29 + 6/29 = 13/29 = 0.448.

This result is confirmed by another method, which can be formulated with the insight provided by the ‘P Maps’ that display ‘natural frequencies’. This is done by combining the proportion of Welsh females who would be expected to die on treatment (6/40) with the proportion of Welsh males who would be expected to die on treatment (7/60) to give the proportion of Welsh people who would be expected to die on treatment to give (6+7)/(40+60) = 13/100. The latter represent collapsible natural frequencies. The proportion of Welsh people who died in the observation study was 29/100. From this we can calculate the predicted marginal RR of (13/100)/(29/100) = 0.448, a result that coincides with the approach based on counterfactuals.

‘Natural odds’ are also collapsible and can be combined in the same way. The natural odds for 7/60 are 7/53 and the natural odds for 6/40 are 6/34. These collapse to the natural odds corresponding to the natural frequency for 13/100 thus: (7+6)/(53+34) = 13/87. The marginal odds ratio is therefore (13/87)/(29/71) = 0.366. Note that probabilities, risk differences, risk ratios and odds ratios may not always be collapsible as pointed out by @Sander in his Noncollapsibility, confounding, and sparse-data bias. Part 2: What should researchers make of persistent controversies about the odds ratio? - ScienceDirect but natural frequencies and natural odds are always collapsible in the above way.

Thank you again Anders for your kindness and patience in explaining this approach to me and @f2harrell for helping us all to understand each other better.

1 Like

The perspective that @Sander has attempted to explain in a number of posts and papers might be more easily grasped by substituting “weighted average” (or expectation) for “collapsible effect measure.”

In actuarial science, the question on how to derive a fair price for an individual who needs to insure a risk is a fundamental one, and tools to analyze the situation from the group and individual level is known in insurance circles as credibility theory. It originated in property/casualty insurance, but there are adaptations to health. You might be more familiar with the term “partial pooling” of estimates, if I remember some of your comments on meta-analysis.

Credibility methods have very close relationships to Bayes estimators.

In credibility theory, the analyst is attempting to compute a price that characterizes the unique aspects of an individual risk with the historical, aggregate experience of the risk pool in question.

The best estimator is a weighted average of the prices, where “more credible” information (lower variance estimate) gets more weight. By convention, the credibility weight Z is applied to the individual component, and the remainder is derived by computing 1-Z for the class component. A credibility of 0 indicates the individual component is homogeneous, and provides no additional information not available from the class mean. A credibility of 1 indicates historical experience is not relevant, the estimates are heterogeneous.

The process is remarkably similar to random effects meta-analysis.

The notion of “best effect measure” seems to have an analog in insurance as “best loss reserve estimate.” For these, estimates that can be interpreted as weighted averages (aka “collapsible”) are seen as most relevant. None of that seems to stop actuaries from using models such as logistic regression, and then computing the needed estimate from it, as Sander recommends.

A medical analogy might be computing the QALY under treatment adjusted for adverse effects of the treatment, derived from a mechanistic understanding of drug action, for a particular patient.

Related Papers

Venter, G. (2003). Credibility Theory for Dummies. In CAS E-Forum (pp. 621-627).

The following description of “greatest accuracy credibility” would be compatible with the example in @AndersHuitfeldt post describing the use of data in one population to make inferences about another.

The greatest accuracy approach measures relevance as well as stability and looks for the weights that will minimize an error measure. The average of the entire class could be a very stable quantity, but if the members of the class tend to be quite different from each other, it could be of less relevance for any particular class. So the relevance of a wider class average to any member’s mean is inversely related to the variability among the members of the class

Blum, K. A., & Otto, D. J. (1998). Best estimate loss reserving: an actuarial perspective. In CAS Forum Fall (Vol. 1, No. 55, p. 101). (PDF)

After a long discussion of the properties of a “best estimate”, they conclude for actuarial applications (underscore in original, bold my emphasis):

It is the undiscounted, unmargined, unbiased, best estimate of the probability
weighted average of all possible unpaid loss amounts.

There is an extensive discussion of various measures of central tendency, and the mathematical justification for using weighted averages.

Nelder, J. A., & Verrall, R. J. (1997). Credibility theory and generalized linear models. ASTIN Bulletin: The Journal of the IAA, 27(1), 71-82. (PDF)

Norberg, R. (2004). Credibility theory. Encyclopedia of Actuarial Science, 1, 398-406. (PDF)

In actuarial parlance the term credibility was originally attached to experience rating formulas that were convex combinations (weighted averages) of individual and class estimates of the individual risk premium. Credibility theory, thus, was the branch of insurance mathematics that explored model-based principles for construction of such formulas. The development of the theory brought it far beyond the original scope so that in today’s usage credibility covers more
broadly linear estimation and prediction in latent variable models.

Here is a case study from a Society of Actuaries research paper.

Zhu, Z., Li, Z., Wylde, D., Failor, M., & Hrischenko, G. (2015). Logistic regression for insured mortality experience studies. North American Actuarial Journal, 19(4), 241-255. (PDF)

In summary, a logistic regression modeling approach allows use of less but more relevant data to address multiple challenges in quantifying insured mortality

1 Like

I understand that you’re not comparing two Bayesian methods, but that’s irrelevant to the point I was making. You stated that Method B “rejects Bayes’ theorem”. When asked for clarification, you made an if-then argument that said “If two methods disagree and if one method uses Bayes’ theorem, then the other method by default must violate Bayes’ rule”. My example serves as a counterexample showing why your reasoning isn’t correct, using two explicitly Bayesian methods just to emphasize the point. More generally, the concept of a method “violating Bayes’ theorem” is not really well-defined. As @AndersHuitfeldt has pointed out, probabilities and functions thereof can’t “violate” Bayes’ theorem. The only way to “violate Bayes’ theorem” that I can think of is to write it down incorrectly.

To go to your specific example, the discrepancy between the two methods stems from the fact that Method A assumes the OR in Stratum 1 is the same as in Stratum 2 while Method B assumes the RR in Stratum 1 is the same as in Stratum 2. But, as @HuwLlewelyn pointed out, the RR in Stratum 1 is not the same as in Stratum 2 in this dataset and so Method B makes a false assumption leading to an incorrect result. This has nothing to do with Bayes’ theorem or violations thereof and everything to do with the validity of assumptions about how various measures of effect vary across groups. In this particular example, the OR is the same across strata while the RR differs, so a method that makes the former correct assumption will lead to the correct answer while a method that makes the latter incorrect assumption will of course lead to a different, incorrect answer. But as, others have pointed out, there’s no reason to assume that this pattern holds more generally. Which measures of effect vary across strata in any particular context depends entirely on the underlying data generating mechanism. To make good assumptions and get valid results we therefore need to make good assumptions about the data generating mechanism i.e. we need to start from good assumptions about the underlying causal model. When @Sander says that you are lost in your math and that his objections are more primary, I believe this is what he is referring to. You have not made any errors in your calculations, but the calculations themselves are a red herring. What matters are the assumptions we make about how effect measures behave in our population(s) of interest, and rearranging terms in expressions for the OR/RR/RD according to Bayes’ rule is not going to provide any insight into that.

5 Likes

If you make assumptions about the data that lack an empirical framework then I guess we will just have to agree to disagree on this

Addendum: You say “Method A assumes the OR in Stratum 1 is the same as in Stratum 2 while Method B assumes the RR in Stratum 1 is the same as in Stratum 2
I think you have missed the point here. I was only interested in showing that predicted probabilities differ by both methods and therefore there has to be the correct basis for choosing one over the other

What makes you think there “has to be” a correct basis for choosing one over the other? Do you mean that the mathematics alone should unambiguously tells you which effect measure is “correct”?

Suppose we have the numbers “2” and “3”, and that we are unsure about whether we want to add them or multiply them. We notice that the first approach results in a “5” and the second approach in a “6”. Does that mean we know that one of the operations is “incorrect”? Do you think that we can mathematically prove that one of the operations is the “correct” choice? Do you think one of the approaches has to be inconsistent with the Peano Axioms?

In my view, if we want to know whether we should “add” or “multiply” these numbers, this depends on the context. In applied cases, it will depend on what the numbers are intended to represent. Do you have two apples and want to know how many apples you would have if you bought three more? Then you should probably use addition.

When we use probability theory in our epistemology for the purposes of making inferences from data, the “correct” operation depends on what physical objects and physical actions we are trying to represent using mathematical objects are mathematical operations. Mathematical logic alone is not sufficient. We use terms like “data generating mechanism” so that we can be precise about what underlying objects we are discussing.

In this case, sscogges is simply telling you that mathematics alone is never going to give you an unambiguous answer for what effect measure to choose. This point alone is sufficient to refute your claims about the odds ratio, and it does not depend on any “assumptions that lack an empirical framework”.

One might then suggest that one way to get to an answer about which approach to use, would be to make justifiable and transparent assumptions about the data generating mechanism. This may not be ideal, but at least it allows transparent scientific discussion about plausibility of the analytic choices. Perhaps more importantly, nobody has suggested a better alternative.

Question for discussion: Is anyone other than me getting concerned yet about the fact that BMJ Evidence-Based Medicine published a paper by Suhail (Likelihood ratio interpretation of the relative risk | BMJ Evidence-Based Medicine) in which the highlighted key message is that “The interpretation of the risk ratio as an effect measure in a clinical trial is naïve and better replaced by its interpretation as a likelihood ratio” because of this alleged “conflict with Bayes theorem”?

2 Likes

Not clear why average risks are interesting.

Read the paper I linked to above on best estimate loss reserving. To allocate resources adequately, a decision maker needs weighted averages in order to estimate future claims and maintain adequate reserves. This is entirely not controversial in the actuarial realm, where organizations bear the cost for miscalculation.

This would be very important to an Accountable Care Organization (ie. a PACE program, for example) that receives capitated payments for Medicare beneficiaries. You would very much like to know if you have members that have a utilization risk higher than what the HCFA calculated premium considers as “fair.”

Accountable Care Models from American Academy of Actuaries

1 Like

Hi again Suhail
I would be grateful for clarification @s_doi . In the BMJEBM (#abstract#sorry=>) rapid response - see below you state that p(Y1|Z1, C) = [0.2] and p(Y1|Z1, T) = [0.4]. Also, please give the marginal probabilities [by inserting in the brackets] p(Y1|C) = [ ], p(Y1|T) = [ ] and also P(Z1|Y1, C) = [ ], P(Z1|Y0, C) = [ ], P(Z1|Y1, T) = [ ] and P(Z1|Y0, T) = [ ].

1 Like

Hi Huw, the paper does not have an abstract … please update your post to which example you are referring to.

Okay, that was not the abstract - that was a rapid response
and we only use marginal probabilities here and state that p(Y1|C) = [0.2] and p(Y1|T) = [0.4]. and then update the p(Y1|C)=[0.5] in another population to deliver a prediction for p(Y1|T) by applying the trial results under the two methods above (A and B)

Thank you Suhail. If we use the terms of Bayes rule then p(Y1|Z1, T)/p(Y1|Z1, C) = 0.4/0.2 = 2 is a first risk ratio and p(Y0|Z1, T)/p(Y0|Z1, C) = 0.6/0.8 = 0.75 is a second risk ratio). By dividing the 1st risk ratio by the 2nd risk ratio (i.e. 2/0.75 = 2.67) you have created a ratio of risk ratios NOT a ratio of likelihood ratios (i.e. in the nomenclature of Bayes rule).

The ratios of likelihood ratios would be p(Z1|Y1, T)/p(Z1|Y0, T) (the 1st likelihood ratio) divided by p(Z1|Y1, C)/p(Z1|Y0, C) (a 2nd likelihood ratio in the nomenclature of Bayes rule) to give a ratio of likelihood ratios.

By changing the nomenclature you are confusing me. If you were able to provide example probabilities for the empty brackets in my earlier post from your BMJ-EBM paper I might be able to understand your reasoning. So in summary, my understanding of the Bayes rule nomenclature is that p(Y|Z) is a risk or posterior probability, p(Z|Y) is a likelihood and p(Y) is a prior probability or marginal risk. I think that @ESMD (from the ‘peanut gallery’) is right in that many of our difficulties in these discussions are due to problems with nomenclature.

1 Like