Subhail: We’re approaching 500 comments just in this thread, and there are scores and scores of papers and book chapters to be read on the topics. Noncollapsibility spans some 70 years if we go back to Simpson, 120 if we go back to Yule, while causal modeling with potential outcomes goes back a century, and their confluence is around 40 years old.
On this thread I’ve reviewed and cited some fundamentals. I regret that has not worked for you, and am afraid you will have to do more in-depth reading on your own if you actually want to understand what you are missing about causal models in general and noncollapsibility in particular (same for Frank and Huw).
The oddest thing for me is this: Basic causal models are simple tools, involving only high-school algebra - in my experience I found that entering students had much less trouble with them than with probability. Yet established professors who did not encounter causal models in their statistical training (i.e., most of those in my and Frank’s cohort) often seem to never quite get what the models are about or what they encode that ordinary probability (“statistical”) models miss.
It’s as if becoming immersed in probability models (which are full of brain twisters) can sometimes obstruct comprehension of causal models, while the probability-naive often find causal models far more intuitive than probability. I speculate the obstruction happens when a person becomes so habituated to thinking in terms of probability and randomized experiments that they can no longer think through a problem of aggregates without invoking probability or randomization, even when doing so only hinders solving the problem. This is one reason I advocate teaching causality as a foundation for probability and statistics (e.g., see Greenland S (2022). The causal foundations of applied probability and statistics. Ch. 31 in Probabilistic and Causal Inference: The Works of Judea Pearl. ACM Books, no. 36, 605-624, [2011.02677] The causal foundations of applied probability and statistics). In any case, it is not the only example of blindness induced by statistical training; significance testing provides a far more common and damaging example (e.g., see McShane, B. B., and Gal, D. (2016). Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence. Management Science, 62, 1707–1718).
Of course Pearl has remarked about the cognitive obstruction of causality by statistics, and rather undiplomatically at times; but harsh exposition should not blind us to the reality of the phenomenon. Our anecdotal observations are I think supported by some work in experimental psychology; @AndersHuitfeldt has noted as much, I believe.
When properly combined with probability, causal models place what are often very severe restrictions on longitudinal probability distributions. Those restrictions enforce basic intuitions about how causality operates, starting with time order. Once that machinery is mastered, the noncollapsibility issue can be framed in terms of loss functions for comparing action choices, e.g., noncollapsible measures do not adequately track losses proportional to average risks (which arise in all public-health problems I’ve looked at).
With the decision and loss framing in mind you might find it easier to approach the topics using the kind of decision-theoretic causal models used by Dawid and Didelez, which reproduce the randomization-identification results derived from PO models without the joint PO distribution that seem to trip up some statisticians. Some of the other causal models competing in recent decades are reviewed in detail in papers by Robins & Richardson cited in the Hernan & Robins book, notably their 2010 book chapter that focuses on graphical formulations, and their subsequent 2013 SWIG fusion of graphs and potential outcomes.