I have carefully read Frank’s comment and your response but first a clarification: You said that I keep introducing information that was not given in my example followed by computation that is not performable under what was actually given. All computations are from your JCE data tabulated above and all can be verified as coming from that data. You used that for the masks example and I completed the analysis using all the data (not just what you reported in the post).
Coming back to my response, Frank has already responded but I am adding my two cents since some of the queries were about Doi et al. First, I have to agree with Frank, that both papers come across as “non-collapsibility is bad because it is bad” and I say this because the main argument given is that somehow lack of mathematical collapsibility is a horrendous characteristic of an effect measure and all arguments flow from there. JCE is aimed at clinicians who are experts at inferential decision making and so this may not be an acceptable argument to them. I agree with Frank that non-collapsibility could actually be a desirable property of an effect measure if this property means that it retains a monotonic relationship with indices of discrimination for a binary outcome and that is why I brought in diagnostic tests and the AUC as both have a direct mathematical parallel to the OR that cannot be refuted.
In your response to Frank, you have said that non-collapsibility is bad because it means the subgroup ORs can be arbitrarily far from every individual effect, even if all the individual effects are identical and there no uncontrolled study bias (such as confounding). This interpretation is true, but it is not bad. It just follows from the fact that without accounting for outcome heterogeneity, discrimination for an outcome is at its lowest level and thus too is the OR and as we continue to account for heterogeneity this improves. You will have to demonstrate to us that the latter is not the case, to convince us otherwise.
You said that, furthermore, this phenomenon (I assume you mean non-collapsibility) should be expected in practice because there is always unmeasured baseline risk variation (as shown by the fact that individual outcomes vary within our finest measured subgroupings). There is no disagreement here at all and we are not saying otherwise either, but as Frank said, the only reason you then link this to a problem for the OR is because it is non-collapsible and therefore it is bad. This is the main point of contention – what is bad about non-collapsibility in this scenario is what we do not know – please tell us. If you mean that non-collapsibility implies an ever increasing OR that does not collapse as we account for sources of heterogeneity then to us that is good because we expect that to happen as I demonstrated for your example on masks.
Then you say that the clinician can never supply a patient-specific OR estimate, she can only supply the OR estimate for the finest subgroup she’s seen reported (which is rarely anything finer than sex plus some very crude age grouping like “above/below 60”) and that is true – no disagreement at all. What is the implication here is the question because you left it at that. Perhaps you mean that further consideration of heterogeneity of treatment effects will give us a different OR and we agree but why is that bad?
You then say that I am thus forced to conclude that you think it not bad that a reported OR will always mislead the user when applied to individual decisions. No, this is not what we think. We think that it is not bad that a OR is non-collapsible and this is where we differ. Why should a reported OR always mislead a user – are you suggesting that it will mislead because it is non-collapsible and therefore you are back to the same point Frank mentioned – non-collapsibility is bad because it is bad. Why are larger subgroup ORs that do not collapse bad and misleading? I have demonstrated they have to be larger if the factor is prognostic for the outcome because discrimination is incrementally improved. Is that wrong? If so, why?
Then you say that mathematically, what is going on is that the OR is variation (logically) independent of the baseline risk, which is a technical advantage for model simplicity and fitting but which also means that it conveys less information about individual effects than an RD. Why does OR convey less information? If it is because it is non-collapsible then we are back to the same point. My previous analysis of your data in the JCE paper (re-interpreted in terms of masks above) demonstrates that the OR conveys as much information as the RD because both the overall and male specific OR could return both the RR and RD of interest. Why we do not prefer RR and RD is because their numerical values can change across subgroups with different baseline risk independent of their impact on outcome (Doi et al) and this does not happen with non-collapsible measures. As an example the RR for males went lower even though the OR was increased in the mask example.
You say that the RD has exactly the opposite pairing: It is severely dependent on the baseline risk, which is a technical disadvantage for model simplicity and fitting, but which also means it conveys more information about individual effects (specifically the individual RD must average to the group RD). Again, you are back to the same point which is that it is good because it is collapsible and therefore it is more informative because it must average to the group RD – we believe the opposite is true and the mathematical evidence stacks up to this (Doi et al) unless you can point out a flaw in the mathematical expressions we report. One flaw you have suggested previously is bias that McNutt suggested but as pointed out in a letter by Karp, there was no bias – we need to pair the right baseline risk to the right OR.
Coming to the resolutions suggested:
people should stop touting the OR as some kind of magic special measure – we did not give it any special status in the papers – we just demonstrated that it gives the correct posterior probabilities because its numerical value is not impacted by baseline risk and therefore conveys direct information and is less misleading than RR/RD. However we suggested to convert to risks only for interpretation as well but the primary analysis as risks is misleading because of the baseline risk problem.
for those who do want a measure that represents average individual effects, the OR simply will not suffice if the risk is not always “low” (in the sense explained in my papers) . This we disagree with and as pointed out above, the OR produces posterior odds that are accurate in subgroups. You say that 10% deviation from the average is the maximum tolerable error but this only applies if we are interested in a numerical interpretation of the OR as the RR and this is not the case we are arguing for.