This is an incredibly long thread that has technical details beyond my pay grade. But I would like to add a different perspective and some history.
First, I would like to thank those that cited some of our work and make some clarifications. Our JCE paper showing OR overestimate risk ratios was a teaching example. Suggesting we were proposing constancy of an effect across different baseline risks in general or even in the condition we used, is a misinterpretation that surprised me.
Second, Doi suggests the Jon Deeks paper showing OR more homogeneous supports his view. The paper with Charlie Poole and Tyler VanderWeele explaining the limitations of Jon Deeks analysis is extremely important. The paper came about because I asked a question to Charlie at a conference when he said RD was the most appropriate measure. I suggested OR and RR were more homogeneous based on Jon’s study that I was aware of. Tyler became interested because heterogeneity is just a form of “interaction”. Tyler’s expertise in understanding “interactions” allowed him to see what the rest of us had missed - Jon’s results were expected because of power issues, even if OR was not truly more homogenous. But the paper has more implications if you read carefully, along with the accompanying editorial by Panagiotou and Trikalinos (On Effect Measures, Heterogeneity, and the Laws of Nature). What does it actually mean to be more or less homogenous? How are we measuring this when the scales are different? Until we address this philosophical question, the issue of differences in “homogeneity” with baseline risk across different scales cannot really be answered. As one example of the associated dangers, the I-squared (a measure of relative heterogeneity often used in meta-analysis) is so often misinterpreted as “measure of absolute heterogeneity” that several papers have been written trying to correct this (see Borenstein, Higgins, Hedges and Rothstein, Res Synth Methods 2017 DOI: 10.1002/jrsm.1230DOI: 10.1002/jrsm.1230 for an excellent review).
Third, “homogeneity” on any scale will be dependent on the true data generating process (DGP) mechanisms. We just don’t know what that is for any particular context. As Sander suggested, it is likely different for different contexts or subpopulations. Examining for effect heterogeneity in a single study by including interaction terms sounds like a good idea, but these analyses are usually grossly underpowered, and therefore do not provide much insight. Relying on p-values to conclusions will likely lead to many incorrect inferences.
In the end, we conduct studies to help inform decision makers, who may be public health, clinicians, or patients. We need to communicate effectively to these groups. I found Doi’s example above with subgroups actually quite difficult to follow but maybe others found it easy. That said, I want to highlight that Doi suggests to convert the increase in death odds to number needed to harm (NNH) for the patient. I am not a fan of NNT/NNH because my experience is that patients do not understand it well, notwithstanding claims by some others. That is a separate discussion. Regardless, NNH is simply the inverse of RD. If Doi believes we should be conveying NNH, he is saying that RD is the more relevant measure than OR to convey to patients.
My own perspective is that for a dichotomous treatment and outcome, patients/decision makers must understand the risk with treatment and the risk without treatment (see Stovitz SD, Shrier I. Medical decision making and the importance of baseline risk. Br J Gen Pract 2013;63:795-7 for more detailed discussion). The risk (not risk ratio or risk difference) is preferable to the odds (not odds ratio) because they are more easily understood. I would suggest you work through the following example. Do you want to have early surgery based on the following:
The odds of having a good outcome without treatment is 1:1, and the odds of having a good outcome with treatment is 3:1. However, if you have surgery, the odds are also 1:9 that you will have scarring and permanent damage due to the surgery. If you delay surgery, the results are not as good and the odds of having a good outcome are 2:1.
How long has it taken you to think about the numbers? And this is a site for statisticians and epidemiologists. Now imagine you are talking to a patient or decision maker who was never any good at math. Why not do your own study and try it with some of your family members?
Now try the same exercise with risks (probabilities).
The probability of a good outcome (risk) is 50% without treatment and 75% with treatment. However, the risk of having scarring and permanent damage due to surgery is 10%. If you wait to see if you heal without surgery but don’t, and therefore decide for late surgery, the probability of a good outcome is 67%.
From the risks, we can easily see that if surgery was routinely performed, we would be operating on 50% of the people that don’t need it (because 50% do well without surgery). Of those that would benefit with surgery (an additional 25%), 2/3 (17% of 25%) would have done just as well if they had surgery later. With a little more math that can be explained to patients (preferably with iconograms so they can visualize the differences), early surgery helps 8/100 more people (75/100 vs 67/100, after including those that heal without surgery and with late surgery). However, 10/100 will have scarring and permanent damage. Do you want the surgery?
Some will decide for early surgery and some against early surgery but that is not important. You may even come up with the same answer using both methods. If you find it easier to use odds instead of probabilities, then you should use odds. If you find it easier to use risks/probabilities, you should use risks/probabilities. If you are communicating to someone else, you should use whatever they find easiest to understand. My own experience is that probabilities are understood much better by my patients than other measures.