This would be my summary too except that when you say “convert these parameters into clinically interpretable collapsible outputs” my concern is why is the term “collapsible” needed here as it gives causal inference experts the wrong idea. If we simply say “convert these parameters into clinically interpretable outputs that are easier to translate to patient care and easier for patients to understand” would be the key message here. Their being collapsible has nothing to do with the reason for doing this
Fair enough. I do agree that including “collapsibility” as a necessary advantage for interpretation is debatable and can completely stand behind your rephrasing.
Having said that, my unproven impression is that clinicians and patients are more likely to understand information presented on probability scales such as risks, from which collapsible measures such as risk difference are derived. Modeling on scales such as the logit are technical moves to facilitate proper range restriction of risks. This technical move at the same time induces non-collapsibility and uses measures such as odds and hazard rates (their logarithms to be more precise) that are harder to interpret. I believe this may be the reason why clinicians and even some statisticians will mistakenly describe a HR = 0.75 as a 25% reduction in “risk”.
Some parallels to our current discussion here - dehydration and hypervolemia (in heart failure) can co-exist but so can overhydration with hypervolemia (non-osmotic release of ADH). Then what do we do - give furosemide or withhold it? If we do not conceptualize volemia and hydration as completely distinct problems to be adressed separately we will not make the right decision. We are in that situation in this thread - we cant agree on the same issue, this time noncollapsibility and confounding and hence many papers discussing these two as we have with hydration and volemia.
Of course, strictly speaking, the primary goal of using penicillin for tonsillitis is not actually “cure” either (but rather to decrease the risk of acute/delayed complications and transmission). Was just using these as simple (though admittedly imprecise) examples of how the same treatment can help or harm depending on the patient in question.
But your broader point that maybe statisticians and epidemiologists are not “speaking the same language” is well taken. How much of the animosity that I sometimes perceive between these two fields is due to the differing worldview that I alluded to above versus recurring miscommunication because the two fields share terms but not a common understanding of what they mean? And if there are profound differences of opinion re the meaning of commonly-used terms, is it any wonder that the terms are used so inconsistently in the literature and in introductory texts designed for students?
- I don’t think there is much disagreement that the human body is highly susceptible to random events. We can have an interesting philosophical discussion about what randomness means, but I don’t think much of what we disagree about here hinges on it. Rather, the disagreement is about about how we can structure the randomness in order to make sense of it and predict future observations. Without finding patterns to the randomness, humans are epistemically lost; any attempt at inference about the world is contingent on some assumed structure to the randomness.
I have proposed a particular structure to the randomness that I believe matches biology. Some people on this board disagree with that structure and try to articulate their objections by invoking “randomness”. But that cannot possibly be their real objection, because their own preferred approach relies just as much on structure to the randomness, just a different structure, one that in my view is not biologically plausible. - The problem with non-collapsibility is that it is implausible (I would argue: impossible) for any biological mechanisms to generate data in which a non-collapsible effect measure is approximately stable across groups
- There seems to be a misunderstanding here, as if modern methods for causal inference make strong assumptions about cause-and-effect relationships and therefore are able to reach stronger conclusions, whereas pre-causal biostatisticians reject those assumptions and therefore reach weaker, more conservative and better justified conclusions. In my view, the world is an unrecognized web of cause and effect relationships (what else could it possibly be?), and nothing anybody can do would change that. Our job as data analysts is to consider the set of all possible variations of cause-effect relationships, rule out the ones we think are completely implausible (for example: because they violate our knowledge of the temporal order), and then ask ourselves whether we can answer the research question in a way that gives the same answer for all the variations that haven’t been ruled out. Pre-causal biostatisticians rely on an impoverished ontology which doesn’t have a language for doing this, but since this counterfactual reasoning is necessary to justify causal conclusions, this just means they are sweeping the problems under the carpet (unless they exclusively restrict their attention to the magic of randomized trials, where the data generating mechanism is so simple that counterfactual reasoning is not required to answer some important clinical questions)
I would like to add that until about 20 years ago, Biostatistics was dominated by an empiricist, positivist and realist tradition. This paradigm was challenged by the emergence of the modern paradigm causal inference, which relies on mathematical objects that do not always correspond to observable quantities (“counterfactual variables”). From the perspective of those who came out of the causal inference tradition, the “positivist” idea that we should only consider observable quantities is problematic because it isn’t possible to describe all the ways nature can work using only observables. If we want to reason about which of the possibles states we can rule out, we need an ontology - a conceptual language - that allows doing maths over objects we cannot observe. Trying to make an epistemology work with a purely observable ontology is wishful thinking: It would be convenient if possible, but it just isn’t.
While these advances in causal inference primarily came out of theoretical work in epidemiology/computer science/philosophy (i.e. Jamie Robins technical work was inspired by Olli Miettinen’s less technical attempts to formalize epidemiological reasoning) this paradigm has deep roots in biostatistics, and importantly, it has already taken over and become the dominant paradigm within theoretical biostatistics. Just look at their journals, I challenge you to find any recent work on methods innovation that hasn’t adopted the technical machinery, notation and ontological framework of causal models. There are, of course, some holdouts from the earlier positivist tradition, but they aren’t relevant to the current state of theoretical biostatistics.
So if I am claiming that causal inference has already “won” and that the opposition to the switch relative risk comes from holdouts from an outdated academic tradition, then why aren’t these causal inference methodologists accepting the switch relative risk? That is an interesting question. The causal inference community has decided, en masse, that we should try to avoid relying on effect measures altogether, in an attempt to be non-parametric. They therefore consider this whole discussion as an irrelevant waste of time. I am a lone dissenter from the causal inference mainstream when I argue that it is impossible to get anything done without homogeneity assumptions that are made in terms of effect measures, and that the choice between them is therefore hugely important. Interestingly, my sense is that most pre-causal statisticians enthusiastically agree with me that effect measures are necessary (while disagreeing strongly about what effect measure to use).
No. Absolutely not. This completely misses the point.
Let’s consider a two stage process:
- We first choose some effect measure to rely on for ours models
- Then we translate the output from this model to some other form for consumption by clinicians and patients.
From my perspective, I have absolutely no interest in the second part of this two-stage process. The entire ball game is about what effect measure we rely on for step one, and the odds ratio just isn’t plausible. In my view, future doctors will literally consider it malpractice to rely on the odds ratio for step 1.
If I had to say something about step 2, it is that we should translate the results to absolute risks (i.e. “your risk if untreated is 10% and your risk if treated is 5%”) rather than any effect measure. This gives the decision maker strictly more information than any effect measure.
It’s not just randomness but randomness to within the ability to measure patterns that would explain some of the randomness.
Only non-collapsible measures have been observed to be stable for the majority of medical and epidemiologic studies that have been analyzed. So your theory does not pan out empirically.
Citation needed. And also, like I have said before: Failure to reject in tests for heterogeneity is an incredibly weaksauce argument for homogeneity.
Yes, this causal fundamentalism (philosophical view and not a derogatory description) is at the heart of your arguments in this thread and others. I am certainly a big supporter of causal inference and I am generally more of a realist than an anti-realist, hence why I run a wet lab looking into underlying biological mechanisms in cancer. However, I have yet to find any convincing arguments that causality is fundamental as opposed to a model of how we experience the physical world. Pearl is aware for example that the “arrow-breaking” aspect of his structural causal models is not compatible with physical laws such as gravity which cannot be turned off. Nor would his structural causal models fit with truly fundamental physics that describe the universe as a whole, as there would be no “outside” to cause an intervention.
Even if causality was fundamental, our models of it may be imperfect as we still lack a universal theory of everything, and we also need to account for the practical fact that our data will always be noisy and need smoothing. But it certainly becomes easier to accept non-collapsible constructs such as ORs as noise-reduction devices from a causal non-fundamentalist point of view.
Do you agree that there are regularities to the universe? Surely, something is generating those regularities? Let’s define that as “causality”. Maybe you would prefer a different definition, but this is how I use the term, so if you prefer a different definition, please try to find your own preferred term for the referent, and whenever I use the term “causality” translate it in your head.
Given that definition of causality, you have three options:
(1) Believe there are no regularities in the universe
(2) Believe causality generates regularities in the universe
(3) Believe something else - which we may here term “magic” - generates regularities in the universe
Of course, it is true that given this definition of causality, humans have no way to experience it directly, and our knowledge of it will be incomplete and speculative. We create representations of causality in our heads. Of course this is imperfect, but it is the best we can do. The closer we get to capturing “real” causality, the better we will be able to predict the consequences of our actions.
If a logistic model reliably predicts the consequences of actions (in multiple different settings), then the odds ratio is structurally homogeneous between groups (i.e. stable for causal reasons). It means that some biological phenomenon results in stability of the odds ratio between groups.
Your argument boils down to this: “our knowledge of causation is imperfect. Since we have no way of knowing whether any effect measure is structurally stable, we should act as if the odds ratio is stable”. But for this to work, there needs to be either a causal or a magical reason for stability of the odds ratio.
Such regularities can be structural relations that do not need to include key causal concepts such as time or even how we experience entropy. In fact, there are quite a few fundamental (or thought to be fundamental) physical laws that do not include time. Your argument about magic is indeed sound in that non-collapsible measures appear to have strong emergent properties that we have never observed in nature but are consistent with magic. I have made the same argument against naive subgroup comparisons on the OR and HR scale in the past.
But the point remains: our statistical models are informed by but are not embodiments of physical laws. They also act as smoothing devices and for this purpose it helps to choose the most numerically stable measures. That is the argument for ORs and HRs. It is the old bias-variance trade-off in action. On the other hand, this quest for parsimony should not smooth away potentially useful patterns in the data. Hence the value of taking your viewpoint seriously and interrogating the data in multiple ways and steps.
What is your thought on the post by @Sander (in response to me) earlier in this thread:
Blockquote
I’d convert your “Do the statistics on the (additive) log odds scale, but convert baseline risks to the probability scale” to something more like “Do the modeling on the log odds (logit) scale, but convert the final results to the probability scale.” because we shouldn’t want to limit the logit model to additive or linear (for example “machine-learning” logit models may become arbitrarily nonadditive and nonlinear). The use of the logit link is strictly a technical move to ensure proper range restriction of risks, variation independence of parameters, well-behaved fitting algorithms, and optimal calibration.“
What about my attempt at a synthesis of @f2harrell and Sander in this post:
It seems to me that the dispute confuses presentation of final results with their extraction from the raw data.
If there is a difference of opinion between these two, Sander seems more concerned with presenting comprehensible final results to clinicians (with little statistical training), while Frank places more emphasis on the correct a reasonable default modelling approach.
I take it that Frank’s concern is that use of the risk scale (based on a preference for collapsible measures) for statistical calculations will lead to inferring causative factors (based on fit for a particular data set) that will not hold up out of sample.
A preference for logit models in the causative inference context doesn’t imply a strong prior against any causative effect; it can be consistent with a prior against a specific factor.
You are begging the question/assuming the conclusion. This entire discussion is about what we think is the most numerically stable effect measure. What on earth makes you think it is the odds ratio? I just don’t get it.
The short answer is that there are other ways to achieve the same objectives, and among those are models based on the switch relative risk. For the longer answer, I will refer to an upcoming preprint that Rhian Daniel, Daniel Farewell, Mats Stensrud and I will hopefully make available sometime around February
Challenge: in a number of large clinical trials quantify the variation in OR, RR, and switch risk ratios. My money is on less variation for ORs. Suggested quantification: proportion of log likelihood that is explained by non-additive effects.
We do have a fundamental “absence of evidence” disagreement. I believe that you should show evidence for non-constancy of ORs and you believe I should show evidence for constancy.
I think we are indeed getting somewhere practical. My intuition is that the unlimited range of odds (log-odds) will result in more stability for ORs versus RRs. But this intuition could certainly be wrong. Empirical evidence will help.
Challenge: in a number of large clinical trials quantify the variation in OR, RR, and switch risk ratios. My money is on less variation for ORs.
Sure, I will eventually have to do some empirical work along these lines. There are multiple reasons I haven’t done it yet:
- I do not have access to a suitable dataset
- I am not someone who enjoys doing empirical data analysis (and was hoping to eventually find a collaborator/student who would be interested in doing it)
- Most importantly: I don’t think we have any good criteria for evaluating relative homogeneity of different effect measures. Standard tests for homogeneity have different power for different effect measures (see for example Poole et al, Epidemiology, 2015), and I am not a huge fan of Q or I2, especially in this context. I also do not think that the proportion of log likelihood that is “explained” captures anything particularly relevant. I have had theoretical ideas for new measures of heterogeneity that are “neutral” between effect measures, and my plan has been to do the empirical work after these new measures have been established. But I got stuck on some obstacles in developing them…
This empirical work will eventually be done, but it will be easier to incentivize empirically oriented collaborators to join me in doing the work if we can first at least get some consensus, on theoretical grounds, that the idea behind stability of the switch risk ratio has face validity.
We do have a fundamental “absence of evidence” disagreement. I believe that you should show evidence for non-constancy of ORs and you believe I should show evidence for constancy.
The odds ratio is just one specific member of an infinitely large set of possible effect measures.
When you argue that I need to show evidence for heterogeneity or OR, I can always use exactly the same argument structure as you, and claim that you need to show evidence for heterogeneity of SRR
In contrast, when I argue that we should rely on theoretical considerations until there is empirical evidence, you will be unable to repurpose my argument structure to argue the opposite case.
My intuition is that the unlimited range of odds (log-odds) will result in more stability for ORs versus RRs.
The disagreement is not OR vs RR, but OR vs SRR. SRR also has this property
I’m looking forward to your paper. I’m curious as to what others think about my off-the-cuff intuitions:
-
I suspect that even if the switch risk ratio can be shown a consistent estimate of causal effect, this estimation procedure will have a higher variance (due to using the data to select which version of the estimate to use) vs. the logit model (which can be specified prior to data collection).
-
How does one synthesize a collection of SRRs with different units (ie. a count of the living and a count of the dead)?
The switch risk ratio model is specified prior to data collection. The SRR just resolves to a number that is in the range [-1,1] where negative numbers means the effect is protective and positive numbers means it is harmful. It doesn’t use any data to determine if the effect is positive or negative, any more than the odds ratio uses data to determine if OR is below or above 1.
How does one synthesize a collection of SRRs with different units (ie. a count of the living and a count of the dead)?
Are you asking about one predictor and multiple outcomes, or multiple predictors and one outcome? The case of one predictor, multiple outcomes is trivial. As for multiple predictors, one outcome: Sorry, you’re gonna have to wait for the preprint for the answer to that one
If we take the complementary outcomes as Y and not_Y then the SRR is RR_not_y and mathematically the relationship with OR is quite simple:
OR_y = RR_y / RR_not_y
There is no question that OR_y is a likelihood ratio that connects the outcome probability under no-intervention to that under intervention. Therefore, neither RR_y nor RR_not_y can lay claim to this property and thus I do not anticipate SRR will be vindicated in any new empirical study.