Huw: This has been illustrated in many articles over the past 40 years, including Greenland Robins Pearl Stat Sci 1999 and most recently here: Noncollapsibility, confounding, and sparse-data bias. Part 2: What should researchers make of persistent controversies about the odds ratio? - ScienceDirect
Please read them and their primary source citations.
See also Ch. 4 and 15 of Modern Epidemiology 2nd ed 1998 or 3rd ed 2008 for definitions of standardized measures.
Huw: I also strongly recommend that you (and Harrell and Doi) read thoroughly and in detail the book Causal Inference: What If by Hernan & Robins, the newest (Mar. 2023) edition is available as a free download here:
The text is searchable, so you can look up the book’s discussions of potential outcomes, counterfactuals, noncollapsibility, randomization, standardization etc.
Sander, this was a really good choice as a reference as the book is clearly written with all terminology explained at first use and the first part defines all key concepts in a very clear and easy to understand style and clarifies what the ‘accepted’ definitions and positions are (for the causal inference community).
In relation to the odds ratios (page 44) they however toe the same line and use personal opinion to say “We do not consider effect modification on the odds ratio scale because the odds ratio is rarely, if ever, the parameter of interest for causal inference.” There is no further information that can shed light on why this position is taken and therefore I assume its the noncollapsibility angle here too but they also do not tell us why its bad or why its ‘logic not-respecting’.
I personally think that perhaps ignoring the fact that the RR is a likelihood ratio has created this situation because LRs are collapsible but only if we ignore the pair (pLR and nLR). Had they been considered as a pair, then we are back to the OR. If we take your example data above, the outcome prevalence in the three groups is 15%, 50% and 85% and the difference between true positives and false positives (aka RD) is 0.1 because the prevalence is being ignored. If however we compute the expected r1 and r0 for the same prevalence using pLR for r1 and nLR for r0 then the RD in the same example is noncollapsible but this is being ignored. Why is it reasonable to ignore the variation dependence on prevalence?.
Subhail: We’re approaching 500 comments just in this thread, and there are scores and scores of papers and book chapters to be read on the topics. Noncollapsibility spans some 70 years if we go back to Simpson, 120 if we go back to Yule, while causal modeling with potential outcomes goes back a century, and their confluence is around 40 years old.
On this thread I’ve reviewed and cited some fundamentals. I regret that has not worked for you, and am afraid you will have to do more in-depth reading on your own if you actually want to understand what you are missing about causal models in general and noncollapsibility in particular (same for Frank and Huw).
The oddest thing for me is this: Basic causal models are simple tools, involving only high-school algebra - in my experience I found that entering students had much less trouble with them than with probability. Yet established professors who did not encounter causal models in their statistical training (i.e., most of those in my and Frank’s cohort) often seem to never quite get what the models are about or what they encode that ordinary probability (“statistical”) models miss.
It’s as if becoming immersed in probability models (which are full of brain twisters) can sometimes obstruct comprehension of causal models, while the probability-naive often find causal models far more intuitive than probability. I speculate the obstruction happens when a person becomes so habituated to thinking in terms of probability and randomized experiments that they can no longer think through a problem of aggregates without invoking probability or randomization, even when doing so only hinders solving the problem. This is one reason I advocate teaching causality as a foundation for probability and statistics (e.g., see Greenland S (2022). The causal foundations of applied probability and statistics. Ch. 31 in Probabilistic and Causal Inference: The Works of Judea Pearl. ACM Books, no. 36, 605-624, [2011.02677] The causal foundations of applied probability and statistics). In any case, it is not the only example of blindness induced by statistical training; significance testing provides a far more common and damaging example (e.g., see McShane, B. B., and Gal, D. (2016). Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence. Management Science, 62, 1707–1718).
Of course Pearl has remarked about the cognitive obstruction of causality by statistics, and rather undiplomatically at times; but harsh exposition should not blind us to the reality of the phenomenon. Our anecdotal observations are I think supported by some work in experimental psychology; @AndersHuitfeldt has noted as much, I believe.
When properly combined with probability, causal models place what are often very severe restrictions on longitudinal probability distributions. Those restrictions enforce basic intuitions about how causality operates, starting with time order. Once that machinery is mastered, the noncollapsibility issue can be framed in terms of loss functions for comparing action choices, e.g., noncollapsible measures do not adequately track losses proportional to average risks (which arise in all public-health problems I’ve looked at).
With the decision and loss framing in mind you might find it easier to approach the topics using the kind of decision-theoretic causal models used by Dawid and Didelez, which reproduce the randomization-identification results derived from PO models without the joint PO distribution that seem to trip up some statisticians. Some of the other causal models competing in recent decades are reviewed in detail in papers by Robins & Richardson cited in the Hernan & Robins book, notably their 2010 book chapter that focuses on graphical formulations, and their subsequent 2013 SWIG fusion of graphs and potential outcomes.
Hi again Sander. Thank you for your advice on reading material, which I gratefully accept and started reading. The diagnostic process is of course causal inference in the medical context and goes back thousands of years. Modern causal inference in medicine is largely based on concepts of positive and feedback mechanisms (that latter described as homeostasis and self repair) that can be represented by quantitative cyclical graphs as well as DAGs. The result of causal inference during the diagnostic process is a ‘diagnosis’. A diagnosis is therefore the title to a series of facts, assumptions and reasoning that form a hypothesis to be tested with a medical test or treatment in an individual and RCTs or other experiments on groups.
Randomisation can thus be regarded as an attempt to emulate inaccessible counterfactual situations (in the current absence of a time machine that allows someone to go back in time an ‘do’ a counterfactual act) and to test hypotheses created by causal inference. In order to diagnose and treat ethically, I have to listen carefully to patients, scientists, statisticians, those studying causal inferences in general and everyone else who can contribute to my aims for a patient. I cannot thank @f2harrell enough for setting up this site to encourage this multi-disciplinary process by allowing us to synchronise concepts from different disciplines through questions and answers or .
I have done searches on this very well written book ‘Causal Inference: What If’ by Hernán and Robins but could not find any reference to the special conditions that allow collapsibility to be present for OR, RR etc (some of which were described by me in an earlier post 478 [ Should one derive risk difference from the odds ratio? - #478 by HuwLlewelyn ]). Can you @Sander or other readers point me to the place in ‘Causal Inference: What If’ (or in another source about causal inference) where such special conditions for collapsibility are specified or discussed?
I have referred many times in this thread and others to this article of ours (section 2.1 and the appendix) that popularizes collapsibility for clinicians drawing on decades of work by @Sander and others. In there, we cite key papers (also previously linked to in this thread and forum) that answer your question such as this, this, and this.
This quote is a keeper.
Huw: If by “special conditions” you mean necessary and/or sufficient conditions for collapsibility, the math-stat literature on odds ratios goes back at least to the work by Yvonne Bishop and colleagues back in the 1960s, which was amended and extended by Alice Whittemore in the 1970s. In the 1980s-1990s odds-ratio results were extended to other measures by many authors. These extensions could be seen intuitively by replacing the survivor denominator of odds by a person or person-time denominator (those denominators being of focal interest when losses are proportional to average risks or average hazard rates, respectively). There are also parallel results for graphical models (e.g., see Greenland S, Pearl J (2011). Adjustments and their consequences – collapsibility analysis using graphical models. International Statistical Review, 79 (3), 401-426).
In teaching I found that the intuitions behind these results were obscured by expressing them in terms of probability instead of counts and graphs - in line with Gigerenzer’s experimental-psych results on the superiority of counts to probability for teaching frequency behavior, DeFinetti’s advocacy of expectations (previsions) over probability for teaching foundations of coherent decision making (betting), and advocacy by others of graphs for teaching causal thinking.
The problem I see in blogs as here (and in more than a few papers) is that discussions like the present one often take place in a scholarly vacuum, reinventing wheels and often square ones at that - rather ironic given what is now at our fingertips. To address that with a historical start on collapsibility and causality see the citations in GRP 1999 and GP 2011 given above; for work in the present century see the citations Pavlos kindly provided (thanks Pavlos!) and those in my 2021 JCE articles. I also recommend Ch. 6 of Pearl’s book Causality. To get a sense of the extent of the general literature on collapsibility, I suggest searching JSTOR using as keyword “collapsibility”, which when I tried it alone led to over 27,000 links, and then again using “noncollapsibility”, which alone led to a more manageable but still formidable 77 links to cull through.
Thank you again @Sander and @Pavlos_Msaouel for further reading suggestions. Perhaps I should be more specific with my question. My understanding is that
(a) marginal RRs will be collapsible if p(Z=1|Y=1∩X=0) = p(Z=1|Y=1∩X=1)
(b) RRs will be collapsible if and only if p(Z=1|Y=1∩X=0) = p(Z=1|Y=1∩X=1) and p(Z=1|X=0) = p(Z=1|X=1)
(c) ORs will be collapsible if and only if p(Z=1|Y=1∩X=0) = p(Z=1|Y=1∩X=1) and p(Z=1|Y=0∩X=0) = p(Z=1|Y=0∩X=1)
Because p(Z=1|Y=1∩X=0) = p(Z=1|Y=1∩X=1) is common to (a), (b) and (c) then the marginal RRs will be collapsible in (a), (b) and (c). I discuss this in detail in my preprint https://arxiv.org/ftp/arxiv/papers/1808/1808.09169.pdf
My specific question is: If the above criteria are well known, what citation should I add to my list of references in this pre-print?
I suggest you focus on the notion of the characteristic collapsibility function (CCF) carefully defined here.
But criteria (a), (b) and (c) are not mentioned explicitly in that paper. I am after a reference that does so.
You are using an associational definition of collapsibility, which is not a definition that captures what people care about when they talk about collapsibility. With a proper counterfactual definition of collapsibility such as the one in On the collapsibility of measures of effect in the counterfactual causal framework | Emerging Themes in Epidemiology | Full Text , the RD and the RR are ALWAYS collapsible regardless of the data, the the odds ratio is only collapsible if (A) treatment has no effect in any stratum, or (B) if the baseline risk is stable between strata, and the odds ratio is also stable between strata.
Associational definitions such as the one you are relying on, are less interesting because such definitions conflate collapsibility with confounding. Any results you can prove based on an assocational definition will be unlikely be of relevant to this discussion. If you insist on going in this direction, I suggest you read the relevant chapters in the textbook “Biostatistical methods in epidemiology” by Newman; if I remember correctly this book has results closely related to your criteria.
The comments by Pavlos and also my 2021 JCE papers gave general modern cites, notably Daniel et al. and Didelez & Stensrud. The earliest correct necessary and sufficient result for odds ratios I know of is Whittemore 1978; an extension of her results to risk ratios (“relative risks”) and other measures was described by Shapiro 1982 http://www.jstor.org/stable/2684092?origin=JSTOR-pdf
although earlier authors had noted the extension exists, and Neyman’s former student Myra Samuels (1981, cited in my second 2021 JCE paper) had noted that the noncollapsibility problem for odds was a consequence of Jensen’s inequality. I suggest you look up these and other cited papers and read them to find citable results.
Thank you for that possible reference. I must emphasise that (a) is a necessary criterion for strict collapsibility of MRR, RR and OR in a three-way table of the type in @Sander’s Part 2 paper and in recent discussions here. I think that that criterion (b) is definitive of RR collapsibility and (c) is definitive of OR collapsibility in a three way table but I would be interested to see a counter-example. There may be other definitions of the term ‘collapsibility’ of course.
Huw, you are simply mistaking sufficient conditions for necessary ones. As Whittemore 1978 showed with counterexamples for odds ratios, (a) is only necessary in the two-stratum case. Counterexamples for risk and rate ratios follow immediately by altering her examples so that the outcome-absent distribution becomes the marginal exposure (treatment) distribution among persons and person-time. I think you need to read more before offering conclusions.
Thank you again @Sander for those references. I only intended that these criteria should apply to a two-stratum case. I suggest that (a) is necessary and sufficient (and therefore definitive) for collapsibility of MRR, but only necessary for OR and RR. I use this property in my preprint. (b) is necessary and sufficient (and therefore definitive) for collapsibility of RR and (c) is necessary and sufficient (and therefore definitive) for collapsibility of OR. However, I only suggest this for the type of three-way table with two strata as described in your table in ‘Part 2’ [ Noncollapsibility, confounding, and sparse-data bias. Part 2: What should researchers make of persistent controversies about the odds ratio? - ScienceDirect ]. I find these criteria useful because they make it easy to create tables that illustrate collapsibility. of OR, RR and MRR.
Hi Sander, while I appreciate the length of this thread, and that this concept spans many decades, I would like to assure you that neither I nor Frank are here trying to further our careers (we are done with that) or any such related motive. We see an anomaly, we cannot understand why it should be given the consideration it has been given and we are asking the questions that editors of journals are afraid to entertain (because they themselves do not understand this and are afraid to question the status quo). I note that my calculations and probability concepts have gone both unanswered and unquestioned – if there is a fault why not point this out mathematically?
You are right, the fundamentals have not worked for me because, from my point of view, all I can see in the literature is that noncollapsibility is bad because it is bad. As a physician who attempts to take responsibility for his decisions I cannot simply accept that. As you are aware all individual decision making in healthcare emerges from studies in populations and making decisions from the evidence (that have used clear methods) is a fundamental role of the physician and I can’t outsource that. I really do not think this is a matter that can be resolved by in-depth reading because, as you have said, this is in itself a very simple concept. Some people call it logic not-respecting when the OR does not belong within the range of its substratum values (in the absence of confounding and effect modification) and noncollapsibility when there are no set of weights that can deliver the marginal from the stratified values. This issue is not related to understanding or not of causal models, we first need to establish why the behavior mentioned above for the OR is bad before building a whole discipline of study on avoiding it – which is what causal inference groups have done.
There are no probability brain twisters in my examples – I just asked a simple question for which there has been no answer – Is the RR actually a likelihood ratio and therefore, by extension, is the issue of noncollapsibility moot? Is this really too much to ask. I have done the math and posted it. Where is the flaw? What is wrong with it? Why are we coming back to causal models without answering this question? Why should we divorce probability and randomized experiments from effect measures and their behavior? I am sure saying a whole lot of things about professions afflicted by blindness induced by statistical training is much more difficult for you than examining the computations I presented and refuting them.
Yes Pearl has said many harsh things about decision makers in Medicine, but, with thousands of citations to his papers, he has not had even one single real life application of his life’s work in Medicine – please correct me if I am wrong? If a method fails to get uptake, it is not always the target user that is at fault, perhaps these are just a series of complex analyses on a wrong framework as we saw in the ‘individual response’ thread.
I am happy that Huw is examining the concepts critically in this thread and am hopeful we can get more insights after he examines my calculations
Subhail: Fortunately, for practical purposes our entire argument is easily laid to rest by simply agreeing that in typical settings a patient and physician should want to see the whole potential-outcome function under different treatment and behavioral options, a view which I thought Frank agreed to long ago.
I should probably stop there. But since you keep asking about the math, I’ll give you my opinion that, as far as I can tell you are lost in your math, essentially having made a mental prison out of it (as is common in statistics, seen up to the highest levels). This blinds you to the objections to noncollapsible measures. Those objections are not fundamentally mathematical, and the objections to your writings are not that you have made a math mistake. The objections are more primary, having to do with the need to recognize and establish real-world goals with utilities or loss functions before diving into math, and the causal thinking and modeling needed to account for the causal elements in all rational decision making.
Many recognize the shortcomings of Pearl’s writings in not offering realistic illustrations of the issues, which I find forgivable because Pearl is a computer-science researcher (an award-winning one at that) working with AI systems that are very simple compared to any in our bodies (although they are increasingly complex and taking over many domains from humans). Despite that simplicity, his messages remain valid: After all, just because someone explains arithmetic with only simple 2+2=4 examples does not mean number theory should be ignored in encryption concerns; likewise, just because someone explains causality with only simple examples does not mean causal modeling should be ignored in decision theory.
Regardless, there have been decades of NIH-supported research with real medical applications of causality theory in work by those like Robins (who is an MD) and colleagues like Hernan (also an MD), especially on longitudinal treatment regimes. Other medical research teams have been taking up their ideas and methods in real, complex medical problems, as I think Pavlos can attest. You might note again that Anders is also an MD, so hardly naive about medical complexities.
Due to decades of experience, however, I may be more exasperated with your resistance to stepping outside your framework, and how you seem committed to defend your framework as sufficient for understanding what we are writing about. I have encountered many others who seemed to be similarly “cognitively locked” in a rigid traditional framework, one that ossified before explicit formal causal methodology for medical problems emerged.
In light of that experience I can only repeat my view that breaking out of that framework requires intuitive nonmathematical understanding of issues that several of us here and many elsewhere have been trying to get across for decades. Again, it is telling that those not habituated into “statistical thinking” can understand these issues relatively quickly, at least if the problems are explained in terms of mechanisms and counting of events rather than with probabilities (which are in the end are derived mental constructs of “propensities” and “bets” based on observed and expected count frequencies).
The RR is a likelihood ratio for a diagnostic test (a highly hypothetical one, that would never be used in practice). This is a trivial observation, and it has absolutely no bearing on the collapsibility discussion, nor on the broader discussion about how to choose between measures of effect. You are acting as if you have proven that because RR is a “likelihood ratio” then it cannot possibly be a measure of effect. This does not follow.
As far as I can tell, your argument is the following syllogism:
Premise A: The risk ratio is a likelihood ratio
Premise B: Likelihood ratios are not measures of effect
Conclusion: Risk ratios are not measures of effect.
Premise A is true. Premise B is false. The conclusion is false, the argument is not sound.
Sander, my post is not meant to be a complaint - I should clear the air that myself and colleagues have learnt a lot from your papers over the years and Medicine has benefited from your insights. But on noncollapsibility I stand firmly with Frank and other like minded clinicians, biostatisticians and epidemiologists because math and probability do not concur with noncollapsibility being bad and what I meant by finding fault with the math is a math/probability error and not an opinion. Lets see what Huw finds as an independent observer with neutral views (as of now).