Great points. I find @f2harrell’s approaches to be elegantly practical and well-thought. My statistical methodology work has been tremendously influenced by his philosophy. This includes an emphasis on risk modification in clinical decision-making that we have incorporated into our utility-based framework. However, there are aspects of research into what we can perhaps call effect modification where non-collapsibility is problematic. The table discussed above makes that point clear. And of course challenges with transportability, meta-analysis etc. It helps to discuss and be mindful of these issues.
I agree with you and indeed this has been my struggle too over the years! Coming to your last point, perhaps causal inference experts are trying to imagine theoretical ideas of mechanisms for the natural world rather than infer those mechanisms from observation of the natural world. The reason the latter is very difficult to do is that the mechanisms in the natural world are obscured by random and systematic error and thus we end up with this discussion since we are unsure whether probabilities in the real world follow a particular pattern. This is not new and Ronald Fisher has said “it has often been imagined that probabilities a priori can be set up axiomatically [i.e. in a way that seems obviously true]; this has led to many inconsistencies and paradoxes…”
Any regression model is a claim that the natural world follows certain patterns. The saturated, non-parametric analysis will be much too high dimensional, so statisticians use models as a dimension-reduction strategy. Dimension-reduction will require assumptions about how the natural world works. Those assumptions can be good or bad. If they are bad, the model will lead to bad predictions.
The assumptions are always there, regardless of whether you acknowledge them or not. By using a logistic model, you are choosing to analyze the data as if you have biological reasons for believing that the odds ratio is stable between strata. If that is true, you will get the right results. If it is not true, you will get the wrong results. The assumption is not empirically tested by the estimation procedure, which just gives you the best fit given the assumption that the odds ratio is stable.
I agree that finding the correct mechanisms is difficult (often impossible), and may not even lead to stability of any identifiable effect measure. What I don’t agree with is your implicit next step: That therefore, we should just give up, and proceed as if we knew that the true biological mechanism is one that leads to stability of the odds ratio. You don’t have a solution to the problem that the mechanism is unknowable, you just want to sweep the problem under the rug and pretend you know that the true mechanism leads to stability of the odds ratio.
Finally, let me point out something that may or not be obvious: Suppose a decision maker is choosing between option A and option B. Any regression model that is used to inform the decision by providing a prediction for what happens under option A, and a prediction for what happens under option B, is being used as a causal model. It will just be a hopelessly causal bad model unless it is constructed in accordance with principles of causal inference and in accordance with your best understanding of the mechanism/data generating function. You don’t get to use a model for decision making and then claim you aren’t doing causal inference and therefore don’t need to worry about causality or mechanisms.
Great thoughts. Our understanding of causal mechanisms has and likely always will be subjective. But we can still make plausible causal assumptions. As a medical example, take Figure 3 from our article here. It is pretty clear from our current understanding of oncology that it is not the “number of metastases” that cause the kidney cancer histology. It is the opposite: more aggressive histologies are more likely to produce metastases.
The quote by Fisher has wisdom in warning us about the challenges of prior elicitation and use. This is in many ways orthogonal to the current discussion. I use Bayesian models (priors and hyperpriors) and utilities in many of my apps (the ones that are actually most congruent with @f2harrell’s philosophy) and find them to be quite efficient and valuable. Relevant to the current discussion, the null hypothesis often used to generate the p-values that Fisher popularized, is rarely true in nature. Even more so, the more global null hypothesis assumption used during multiplicity correction (e.g., when looking at multiple gene pathways) is extremely implausible. But p-values and their multiplicity corrections have been successfully used in many scientific applications.
I acknowledge that non-collapsibility is even more unnatural (to the point of appearing more consistent with magic than with the natural world). This is why some people have such a viscerally strong reaction against non-collapsible metrics. The fact that it is a mathematical property of certain metrics we have found to be useful does not make non-collapsibility a property of nature. A good parallel is infinity. Infinity comes up in the formulas of many valuable equations successfully used in the natural sciences. But as far as we can tell to date, infinity does not actually exist in nature. Every time we get infinity in a calculation we know that the story is incomplete.
What about non-collapsibility is bad?
On the broad discussion, first of all I want to say that this is a truly excellent discussion that we should tweet out from time to time. Second, I want to return to the pragmatic approach. First, a few background observations:
- If there is only one variable in need of consideration and that variable represents mutually exclusive categories, all link functions (logit, probit, etc.) will yield the same estimated category probabilities.
- If there is only one variable and it is continuous, fitting the variable flexibly (e.g., using a cubic spline with 4-5 knots) will result in virtually the same estimated probabilities no matter what the link function.
- Things get more interesting when there is > 1 predictor.
For the > 1 predictor case I think pragmatically about what is a good default starting point. I don’t think for a minute that a default approach will work for all situations/datasets. Data analysis is hard and prior knowledge about data patterns is not frequently available. So we need help. If a particular approach worked best in 8 out of 10 representative datasets I would continue to use it in many cases as a default approach. So what do we mean by “worked best?” Here is the way I think about that. Ignoring nonlinearity for the moment, I would choose 10 moderately large “representative” datasets and four or more link functions such as the identity function (additive risk model), logit (logistic model), log probability (relative risk model), and probit. For each dataset and each link function compute A = likelihood ratio \chi^2 statistic for a model containing all the predictors and all their two-way interactions, and B = the same but omitting all interactions. Compute the ratio B/A, the so-called “adequacy index” in the maximum likelihood chapter of Regression Modeling Strategies. B/A is the relative extent to which interaction terms were needed to explain outcome variation (high values indicating less need for interactions). Display B/A ratios for all datasets and all link functions. See what you learn. I postulate that link functions that carry no restrictions, i.e., logit and probit, will fare the best most of the time. I seek models in which additivity tends to explain the great bulk of outcome variation.
Then whatever link function you prefer, state estimates and uncertainty/compatibility intervals for the parameters and then use the model to estimate absolute risk reductions and relative risks when one predictors is changed and the rest are held at representative values.
This proposal is very similar to what @Sander wrote in his 1979 paper on logistic regression limitations:
Blockquote
A general paradigm can be derived … The necessity of adding complex terms (eg. product terms) to a multiplicative (additive) statistical model in order to achieve a “good fit” may indicate that the exposures under study do not influence the disease risk through a process following the multiplicative (additive) biological model; the form of the best fitting, parsimonious statistical model may in turn suggest a form of the biological model in action… Results of such generalized model searches may be combined with physiologic observations and data from animal experiments to test old and generate new models…
I also don’t understand this preference for “Marginal” models over conditional ones. Direct reference to the Nelder paper I mentioned above:
Blockquote
An advantage of a random effects model is they allow conditional inferences in addition to marginal inferences.… The conditional model is a basic model, which leads to a specific marginal model.
Anders, there is a point of correction or much greater precision needed here: You should say “By using a main-effect only (first-order) logistic model…”. Remember that any good modeler with enough data will consider product terms (“interactions” or second-order terms) that will allow the odds ratio to vary with covariates and may also investigate which link function provides the simplest linear term, as Frank discusses in his comment. Then too, with our modern computing capabilities we can use “machine learners” that start with a highly saturated model and penalize for complexity in a manner that need not shrink toward constancy of the odds ratio.
We need more attention to the distinction between the statistical task of smoothing or data stabilization (which arises because we have noisy, finite data) and the broader task of causal explanation that is part of the search for patterns stable enough for generalization to settings beyond the one that generated the data under analysis, which is what decisions demand (we make decisions for patients that were not in the study). Causal considerations are important for both tasks but become central for decisions.
I agree with this, I definitely should have been clearer. I am not sure this is central to the point I was trying to make, but the sentence on its own should have been clearer.
I prefer to think of the need for interaction terms as arising from the fact that you can often make a more plausible argument for conditional effect measure homogeneity (conditional on some subset of covariates in the model) than for complete effect measure homogeneity. Conditional effect measure homogeneity is still a homogeneity assumption, and subject to the same kind of reasoning about whether the biology is consistent with the choice of link function.
Here is what I mean by that: Suppose you have a model with predictors X, Y and Z, and a product term is needed between X and Z. You can parametrize the logistic model with separate odds ratio parameters for the effect of Z conditional on X=0, and the effect of Z conditional on X=1. This gives your model a structure where the effect of Z conditional on (X=0, Y=0) is equal to the effect of Z conditional on (X=0, Y=1) on the odds ratio scale, and that the effect of Z conditional on (X=1, Y=0) is equal to the effect of Z conditional on (X=1, Y=1) on the odds ratio scale.
This is not the standard parametrization, but the parameters in the standard parametrization are just deterministic functions of the parameters in my parametrization. The advantage of this parametrization is that it makes it easier to appreciate the nature of the conditional effect homogeneity assumptions made by a model with interactions.
In a few months at most, I will share a new preprint with significant more detail about my thoughts on regression models. I look forward to sharing that manuscript with this forum, hopefully it will make perspective clearer to everyone.
Sounds like you may get into coding options for regressors. If so I’ll refer to Ch. 20 of Modern Epidemiology 3rd ed 2008, p. 401-408.
This is for me a strange question given that there are now 40 years of articles trying to explain why it is bad. Even more strange is that this is a very subtle numeric phenomenon, yet in my experience it seems more easily appreciated as a problem by clinicians than by statisticians.
I’ll give an illustrative example based on the math in Greenland (“Interpretation and choice of effect measures in epidemiologic analysis”. Am J Epidemiol 1987;125:761–8): Suppose (quite hypothetically and simplified of course) an attending physician at an ICU has just been given the abstract from a large RCT done on a random sample of all covid ICU admissions, with the treatment being oxygen via mask alone vs. intubation+sedation, and the outcome being 60-day mortality. The abstract says the average age of participants was 65, that 38% of them died within 60 days, and the observed overall odds ratio was 2.75, but results are not given for subgroups.
The physician is then confronted with a new covid ICU admission of a 65-year old man who is accompanied by a wife concerned about the long-term cognitive effects of intubation+sedation and asks for mask alone. Should the physician tell her that this will increase the patient’s odds of death by a factor of 2.75? If so that would be wrong because the risk of death varies tremendously across those known covariates, and consequently the increase in odds is likely far higher than 2.75 for this and every single specific patient! But given only the OR, the physician would be unable to say anything more precise such as how much higher.
In contrast, given that the RD was 23%, the physician could at least say that on average the individual increases were 23%, and so if this patient is not visibly atypical of trial participants then 23% would be a reasonable expectation for the risk increase from refusing intubation. In this way, the OR is much less informative about subgroups and individuals than is the RD.
For further numeric discussion see the last two articles (“Noncollapsibility, confounding, and sparse-data bias” parts 1 and 2) on this JCE special issue page: Journal of Clinical Epidemiology | Key Concepts in Clinical Epidemiology | ScienceDirect.com by Elsevier
BTW our response to the latest from Doi et al. is posted at [2106.02673] Odds Ratios are far from "portable": A call to use realistic models for effect variation in meta-analysis
Frank: Why are you taking ratios of LR statistics? They are already logs of the LRs so have been transformed from the ratio scale of the likelihoods to the additive scale of the loglikelihoods. The A-B loglikelihood difference D is the Kullback-Leibler information divergence of B from A (a measure of how much information B loses compared to A); 2D is an LR statistic (with df equal to the number of product terms in A) which provides a P-value for testing B against A.
I can compare the D, their P-values, and their S-values (surprisals) -log2(p) across links to get comparisons scaled in ways I am familiar with from 70 years of literature on model checking and selection. What is added by the B/A ratio? What does it mean in frequentist, Bayesian, likelihoodist or information terms?
I agree with Frank – we urgently need to answer this question: What about non-collapsibility is bad?
Lets take the same example given by @Sander (which is from the data in the JCE paper that I have tabulated above). The overall mortality is 38.33% and OR is 2.75 for masks versus intubation (assuming such a trial was allowed). When a new patient 65 years old comes in and we only have this information, then all we know is that in this population, given that we have a M:F ratio of 1:2, and we refuse to know the gender of the new patient (or any other demographic information), the observed mortality odds increases 2.75 fold. The wife may not know what is odds so a quick calculation can be made and the patient told that 26.7% of the intubated and 50% of the mask patients died and that the risk difference is 23.3% (for every 5 treated with a mask compared to intubation, 1 more patient died at 60 days) or the fold increase in risk was 1.875 (RR) which is almost two-fold. All this is correct and deduced from the OR (note all risk related information was derived solely from the OR using the equations in Doi et al), so I do not see any “fault” of the OR here. Perhaps @Sander can explain further why the OR was bad and the RD was not bad so we understand what we are missing here. In particular please explain why the OR is far higher for this and every single specific patient but the RD was okay even if it is a direct derivation from the OR?
I suspect @Sander will say that in males the OR is increased 6 fold for masks, so if we had this information it is not congruent with the earlier 2.75 increase in odds overall. But why is that a problem? The male gender is a very strong risk factor for death and if we refuse to know how males were distributed at baseline would we not expect the discrimination of deaths by mask to weaken? Again from the OR of 6 (Doi et al), I can tell the wife that death risk went from 60% to 90%, RD was 30% and for every 4 male patients treated with a mask compared to intubation, 1 more patient died at 60 days. This is a risk increase of 1.5 fold. Note the earlier risk increase was 1.875 fold. Obviously the RR dropped when the OR, NNT and RD worsened and given that the OR is a transformation of the C statistic we can clearly deduce that there is no monotonic relationship of the RR with death discrimination.
Your response is remarkable for slipping in information where none was given: Nowhere did I state that the M:F ratio was 1:2 or that the male OR was 6, nor was any other sex specific information given or needed to make the point. Now in the example from which I took the numbers, the male OR was 6.0 and the female OR was 3.9 - both much higher than the overall causal OR of 2.75. Which only further proves the point: Without specific information about prognostic factors in the trial population, we cannot from the total OR alone provide a specific bet or even a betting range for the impact of treatment on the new patient’s odds or risk, even if that patient is a typical patient from the same population on which the trial was conducted. In sharp contrast, the RD alone provides a coherent bet for patient’s absolute risk increase.
If you don’t like the betting formulation, read the stochastic potential-outcome formulation in my 1987 AJE paper, where I show that every single individual can have an OR higher than the group OR (rendering the group OR misleading when applied to individuals) whereas this defect is not present with RRs or RDs, and group RDs are always simple averages of individual RDs. Read also Greenland Robins Pearl (“Confounding and collapsibility in causal inference”. Statistical Science 1999;14:29–46.) [For completeness I add again that hazard rate ratios suffer the problem, but in practice to a much lesser extent than ORs do.]
Again, the problem with ORs has been obvious to many people and explained in many ways over some 40 years, and does not require deep math to see; simple numeric examples suffice. So, as I said way before in this long thread, I cannot ascribe your continuing failure to see it and your responses (which as Anders noted keep slipping in assumptions or irrelevancies) to anything other than a cognitive blockage. Again, I suspect that blockage arises from finding impossible to believe that you overlooked something so obvious, plus a commitment to defend ORs as something special beyond the technical advantages (which we all recognize). Those technical advantages are countered by the amply documented interpretational disadvantage of severe noncollapsibility, which in the end makes the OR a classic example of one of the most fundamental laws of methodology: The NFLP (No-Free-Lunch Principle).
I read the first of Sander’s two new J Clin Epi papers, and like the replies above this paper comes across as “non-collapsibility is bad because it is bad”. In the medical example there is nothing bad about non-collapsibility. What is bad is that the paper the clinician is quoting did not provide a patient-specific effect estimate. If we start by treating “non-collapsibility” as pejorative we tend to end a discussion with it being pejorative.
In the history of development of methods for a 2x2 table the assumption of homogeneity is made (e.g., probability of outcome of everyone on treatment i is p_i). The simplest forms of risk differences, risk ratios, and odds ratios were designed to apply to this situation, not to the case where there is within-group outcome heterogenity. So in my personal view, indexes such as unadjusted odds ratios were designed for the homogeneous case. In the face of heterogeneity they are far less useful.
For the current purpose I could have used the classic likelihood ratio \chi^2 statistic \Delta = A-B. That statistic is proportional to the sample size. I am seeking an index that is sample-size independent. A/B is that. Another approach might be the difference in pseudo R^2; one of the terms would be \exp(-\frac{A}{n}). One could also argue for \exp(-\frac{\Delta}{n}) or just \frac{\Delta}{n} but interpretation depends on the nature of Y. For example, in time-to-event analysis the literature has shown us that the proper n to use in the denominator is somewhere between the number of events and the number of subjects.
First, I agree with @Sander that the OR debate does not require deep math to see; simple numeric examples clearly should suffice and everyone on this blog should be able to follow easily, though I am not sure why the debate has gone on for 40 years!
In summary, the main argument from @Sander (in this post) is that subgroup ORs are wrong because they do not collapse on to the overall OR and subgroup RDs or RRs are right because they collapse on to the overall RD or RR. The implication is that even if we get the outcome model wrong by refusing to account for easily accounted for outcome heterogeneity, the RR/RD result from the wrong model can provide a coherent bet for patient’s relative/absolute risk increase s because the two subgroups collapse on the overall.
Let’s see if this holds up to further scrutiny. If we consider masks as an index test and death as the gold standard, the overall Se and Sp for masks in relation to death is 65.2% and 59.5% respectively. In the male subgroup the Se and Sp are 60% and 80% respectively. The AUC overall is Se+Sp/2 = 0.62 overall and in males is 0.7. Thus there is greater discrimination of death through masks in the male subgroup compared to overall and therefore the OR reflects this, increasing from 2.75 to 6. The numerical RR value however declines from 1.875 to 1.5. Granted that, without specific information about prognostic factors in the trial population we did not know the gender information at baseline; what we did know at baseline was that the RR is collapsible hence it will not reflect within subgroup death-discrimination accurately while the OR is non-collapsible and therefore will reflect within subgroup-death discrimination accurately. Now what shall I tell the wife of the new patient? I could say one of the following:
a. There is at least a 2.75 fold increase in death odds (convert to NNH for her) with masks in this population and since the trialists refused to account for any other patient related factors, we have no additional information for your husband but if we did know subgroup values for people more like him, the subgroup values are likely to indicate the same death-discrimination or greater for people like your husband.
b. There is a 1.875 fold increase in death risk (convert to NNH for her) with masks in this population and since the trialists refused to account for any other patient related factors we have no additional information for your husband but if we did know subgroup values for people more like him, the subgroup values may be inconsistent with actual death-discrimination for people like your husband.
What do I mean by inconsistent? The RR of 1.5 in men is not comparable to 1.875 overall and lacks true comparative meaning – in short it does not imply that the death risk is less in men than in the overall population. On the other hand the OR does have this interpretational consistency. I and everyone else would be keen to see your arguments on this.
At last, Frank, you have stated your perception in a way that looks to me truly colorblind to explicit statements of why noncollapsibility is considered by so many to be bad: It’s bad because it means the subgroup ORs (which is all clinicians know about because that is what is highlighted in reports and quoted by editorials and medical media) can be arbitrarily far from every individual effect, even if all the individual effects are identical and there no uncontrolled study bias (such as confounding). Furthermore, this phenomenon should be expected in practice because there is always unmeasured baseline risk variation (as shown by the fact that individual outcomes vary within our finest measured subgroupings). So the clinician can never supply a patient-specific OR estimate, she can only supply the OR estimate for the finest subgroup she’s seen reported (which is rarely anything finer than sex plus some very crude age grouping like “above/below 60”)
I am thus forced to conclude that you think it not bad that a reported OR will always mislead the user when applied to individual decisions, unless it is combined with other information to transform it into a collapsible measure like the RD or component risks. You need to confirm or deny that impression.
Mathematically, what is going on is that the OR is variation (logically) independent of the baseline risk, which is a technical advantage for model simplicity and fitting but which also means that it conveys less information about individual effects than an RD. The RD has exactly the opposite pairing: It is severely dependent on the baseline risk, which is a technical disadvantage for model simplicity and fitting, but which also means it conveys more information about individual effects (specifically the individual RD must average to the group RD). The RR falls somewhere between these extremes in being not as technically well behaved as the OR but providing some information about individual effects; also not as ill-behaved technically as the RD while providing some less simple information about individual effects.
If after a few generations there is no agreement about noncollapsibility then we should not expect any in your and my remaining lifetimes. What might agree on is that, for clinical applications, the focus of reporting should be not on effect measures but on full reporting of risk functions. I have been advocating that from the beginning of this controversy in the 1970s as have others.
This resolution would however require concessions on your part that (1) people should stop touting the OR as some kind of magic special measure (which is what the papers by Doi et al. and Doi’s comments here come across as doing) and that (2) for those who (unlike you and Doi) do want a measure that represents average individual effects, the OR simply will not suffice if the risk is not always “low” (in the sense explained in my papers, e.g., if 10% deviation from the average is the maximum tolerable error then the OR shouldn’t be used if the odds can exceed 10%).
Nobody has ever claimed that collapsibility is sufficient for being a good effect measure (let alone being the generally “correct” effect measure, which is an idea that you need to let go of).
You will get different predictions depending on whether you use RR or RD, and they are both collapsible, so clearly collapsibility is not sufficient.
I do however believe that collapsibility is a necessary criterion for being a plausible effect measure, for reasons that have been explained in detail, both in this thread and in the literature, going back half a century.
Is it possible that what is happening is that clinical researchers, who are simultaneously placed in the individual patient decision role as well as the analyst/modelling role, believe that for individual risk modelling must be done on the risk scale, and do not understand the technical advantages of a transformation to the OR for statistical modelling and smoothing?
Frank gave an example I posted above where he combines the RD and OR in a logistic model to come up with a nice nomogram showing how individual risk interacts with treatment (in terms of absolute risk reduction).
Is it also possible that statisticians, only finding themselves in the modelling portion of clinical research, focus more on the technical advantages of the OR, and simply assume clinicians will know how to transform it into a useful metric?
And epidemiologists, being focused on observational research and risk, find problems with the OR due to many factors in study design they cannot control?
Doi: I think Anders already commented on the assumptions you slip in here to save the OR with what looks to OR critics like circular reasoning (or worse), e.g., that the clinician can immediately access the assumed information from available research reports, and also has the facility to combine the OR with that information to obtain more comprehendible measures like NNH, and that AUC is a desirable target. One key point is that violation of all these assumptions is the norm, so slipping them in looks to us like cheating to win.
The AUC angle is however a bizarre distraction in another sense: Anyone who knows screening knows that sensitivity and specificity information (which is what AUC is derived from) is woefully inadequate for clinical prediction; predictive values are what is needed and those are exquisitely sensitive to the baseline risk information missing from AUC just as it is missing from OR. Hence the OR correlation with AUC does nothing to enhance the credibility of the OR for us, it just looks like another way to miss our points. The comments about RR keep missing the same points. Those points are explained once more in my reply to Frank, so please read those carefully before replying.
R-cubed: Clinicians should not be assumed to be in an analyst/modeling role. They usually have no meaningful training or understanding of data analysis or modeling, and have no time for that. And in terms of medical topics, they rarely have time for reading more than brief summaries given in editorials in their journals and medical newsletters such as Medscape and MedPage, and in continuing education materials (which look like they are summaries of summaries). They are thus critically dependent on what is shown in those summaries, which is not much. Because of lack of time and sheer volume of literature, few can go further and most of the tiny minority that do (usually med faculty) still retain only what is highlighted in abstracts.
Yet I find many statisticians write as if typical clinicians are somehow sophisticated enough to appreciate technicalities that few could begin to comprehend. Perhaps this is because the statisticians are insulated in medical schools dealing with research faculty, who are typically far more conversant in statistics than the vast majority of clinicians (although, as many can attest, clinical faculty are still far more vulnerable to errors of their statistical authorities, as witnessed by continuing use of correlations and “standardized” coefficients as effect measures in some reports).
Finally, I’ll repeat that the OR controversy has absolutely nothing to do with uncontrolled factors in study design (potential bias sources or validity problems). It would arise even if every study was a perfect randomized trial on a random sample of the clinical target population (as in my example).