It provides a gentle introduction to DAGs for physicians and uses a few clinical scenarios to illustrate their use in research design.

To summarize (I think): researchers would ideally construct a DAG in the design phase of their study, depicting how different variables can act as potential confounders/colliders/mediators. This DAG would clearly show which variables will need to be adjusted for/“conditioned” on during the analysis phase, and those for which adjustment could introduce bias(?)

Novice question (sorry): How can researchers be confident in depicting the various relationships in their DAG if those relationships were elucidated using previous studies that probably didn’t follow a rigorous DAG construction process? Won’t the final DAG used to design any new study only be as reliable as the studies from which the various DAG relationships were derived?

Good question. There will never be a perfect DAG. Nor a perfect diagram accurately illustrating what is happening inside a cell. The map is not the terrain. Even if the previous studies did not use DAGs, they still contained implicit causal assumptions. This is inevitable for statistical science and science in general. DAGs can be a useful tool to explicitly represent such assumptions.

Note also that in some cases the statistical analysis model will be the same even if the DAG changes. But in other cases readers may find that the pre-specified DAG was not plausible according to their contextual knowledge, and the DAG they find plausible leads to a different analysis model. In that scenario, the new DAG and corresponding analysis can be included as additional material in the manuscript if it is the peer reviewers that want to explore a different causal model. Or (if the primary data are available) readers can do their own analysis after publication.

As an example of the danger of “indiscriminate adjustments”, here in early 2020 they adjusted for a SOFA score and showed that COVID-19 was LESS deadly than the flu. SOFA has 6 variables rendering a single output so the creation of the complex DAG with all those variables (which includes blood gas variables affected by COVID-19) would have given pause.

This is VERY important material for critical care research which is struggling under a mountain of non reproducibility. Following up how can this be taught and promulgated. As an example here is a causation graph I published (from 2009).

This is only a fraction of the variables. I see little interest in engaging or even discussing this complexity. Rather studies in this field are consistently performed in a oversimplified way as if this complexity does not exist.

Pavlos’ makes some key points regarding the use of DAGs. DAGs help you define which causal questions can be answered with de data at hand. That’s the first point in Dr. Harrell’s list. This refers to whether the stable-unit-treatment-value assumption (SUTVA) is met. DAGs also help you to design your study. In particular, they help you figure out what variables you need to measure and when, so that a causal effect could be identified. This not only adds to the quality of you study, but could cost-saving. In addition, DAGs help you identify the minimum set of variable you need to condition on to identify a causal effect. This is a great improvement to the dominant approach of including and variables in a model based on statistical criteria or, which in my opinion is unavoidable, based on our conscious or unconscious biases about how our findings should look. I mean, it helps us to play around with the data, until we get what we consciously or unconsciously deem “the right model”. This approach is not only biased, but compromises the validity of statistical tests and confidence intervals. On that regard, DAGS provides a setting for statistical testing that is closer to ideal randomized experiment than any other procedure for variable selection. Moreover, DAGs are explicit and allow for the discussing of findings to proceed in a objective way. In particular, they allow us to make a useful judgement about the role of unmeasured confounders in observational studies. Instead of using the generic argument of “this is an observational study and, therefore, findings could be partially or completely explained by an unmeasured confounder”, which is an argument that does not advances research, we are compelled what confounder that would be and assess whether the claim is possible by adding the counfounder to the DAG. For those reasons, I think that drawing a DAG before designing your study or analyzing your data will contribute to reproducibility. Of course, DAGs take time during the design of the study or the “design” of data analysis. The main problem comes from the need to include in the DAG all variables that are parent (directly or indirectly) of both the exposure and the outcome (measured or not). However, this helps to identify gaps in knowledge, to recognize the limitations of our research, and to make observational studies closer to randomized trials in regards to the identification of causal effects.

I will embark on the thought exercise and dialogue with a few lemma (for our dilemma ;-):

Nature (and reality) is complex, and Nature is indifferent to what is cognitively convenient for humans.

Human are ‘obligate abstractors’: abstraction at some level is necessary for our all human functioning—whether we are aware of it (or like it), or not. Otherwise, it’s “turtles, all the way down” (the regress argument in epistemology).

Scientists always have to make decisions about delineating or circumscribing the boundaries of a problem, and factoring out the operative elements of the system that is the object of their inquiry. Often those decisions involve ambiguity and entail annoying contingency. I am sympathetic that this is both discomforting and discomfiting.

This is all by way of asserting that the annoying fact that it isn’t easy does not contradict the merit of the method proposed. And the fact that we are vexed by complexity does not mean that indiscriminate, arbitrary, or agnostic conditioning are what we have to ‘satisfice’ with.

An important thing to recognize and keep in mind in using SCM/cDAGS is that the basic cDAG is useful for static acyclic problems: problems where time and feedbacks are not aspects of the system. Problems like time-dependent confounding require special methods. I suspect that awesome ‘confuse-o-gram’ you provided above (I love it!) is one of those aporia that are best addressed with Marginal Structural Models and all that really fancy stuff; but are not tractable by the ‘bog-standard’ static cDAG.

For instance, in observational studies with treatments that vary over time, standard approaches for adjustment of confounding are biased when there exist time-dependent confounders that are also affected by previous treatment. Estimation and inferential complications arise when there is a time-dependent covariate that is a predictor of the outcome of interest, and also predicts subsequent treatment, and past treatment also predicts subsequent level of the covariate.

The problem expressed in the network diagram you provided above is likely not acyclic; and look to me more like the something appropriate for ‘MSM’s and fancier stuff’. There are methods proposed for handling time-dependent confounding and other complexities of that ilk. But it requires sophistication, specialist support, and lots of work and time: and that just sounds to me like Science.

So my initial and provisional response to your query “what can be done?” is take a deep breath, look at other ways of factoring the problem and look for other statistical tools indicated for the problem: but I would take issue with anyone throwing up their hands and saying it is too complicated and the best we can do is what we have been doing in the past—throwing things in the model until it strains to converge in the interest of getting a result that we can publish, only to grow the “mountain of non reproducibility” further.

I hope this is helpful. I will reflect more on this and hopefully provide a more ‘operationalizable’ response.

A ‘grab-bag’ of guidance to consider:

Robins, James M.; Hernán, Miguel Ángel; Brumback, Babette. Marginal Structural Models and Causal Inference in Epidemiology, Epidemiology: September 2000 - Volume 11 - Issue 5 - p 550-560.

Miguel A Hernán, Babette Brumback & James M Robins (2001) Marginal Structural Models to Estimate the Joint Causal Effect of Nonrandomized Treatments, Journal of the American Statistical Association, 96:454, 440-448, DOI: 10.1198/016214501753168154

VanderWeele, Tyler J. “Marginal Structural Models for the Estimation of Direct and Indirect Effects.” Epidemiology, vol. 20, no. 1, 2009, pp. 18–26.

Talbot D, Rossi AM, Bacon SL, Atherton J, Lefebvre G. A graphical perspective of marginal structural models: An application for the estimation of the effect of physical activity on blood pressure. Statistical Methods in Medical Research. 2018;27(8):2428-2436. doi:10.1177/0962280216680834

Hernán MA, McAdams M, McGrath N, Lanoy E, Costagliola D. Observation plans in longitudinal studies with time-varying treatments. Statistical Methods in Medical Research. 2009;18(1):27-52. doi:10.1177/0962280208092345

Chiba Y, Azuma K, Okumura J. Marginal structural models for estimating effect modification. Ann Epidemiol. 2009 May;19(5):298-303. doi: 10.1016/j.annepidem.2009.01.025. PMID: 19362275.

Robert W. Platt, Enrique F. Schisterman, Stephen R. Cole, Time-modified Confounding, American Journal of Epidemiology, Volume 170, Issue 6, 15 September 2009, Pages 687–694, https://doi.org/10.1093/aje/kwp175

I recently realized that for the problems I deal with the G methods for longitudinal data that @Drew_Levy mentioned are probably more appropriate. However, the issue is many many clinicians/biomedical researchers are not yet familiar with these methods. I did attempt recently to go down the rabbit hole of MSM/SCMMs and G-comp, and I chose a Bayesian approach on top of that, but I ran into skepticism of the results. First of all, many biomedical researchers want clearcut p-values (despite the problems) and getting these for my model which had plausible interactions I wanted to regularize (hence Bayesian methods) and so on was not straightforward. And I think they are just used to the standard, even if biased, approaches that have worked and been published before.

Causal inference computation methods (outside drawing a DAG to understand the problem) largely right now are seen as more experimental than the vanilla associational based methods that researchers have seen. I fear that the issue is the methods require a significant mathematical background to even approach and on top of that are difficult to communicate to non-statistician/ML people.

I see your frustration. The implications are cause for despair.

First, it is becoming increasingly evident that the ‘standard’ approaches ‘that have worked and been published before’ are not in fact ‘work’-ing’; and being published is a poor measure of veridicality. The merit of much published work may be limited to providing the comfort of familiarity; a p-value (from anything other than a well-conceived, well-designed, well-controlled, well-conducted RCT) telling one only that some inconsistency with an artificial null notion may register in the data.

The attitude you describe (stalwart attachment to the status quo) will be familiar to those acquainted with the history of Science (“And yet it moves” —Galileo; “I know of nothing more terrible than the poor creatures who have learned too much”—Ernst Mach [ironically]).

We get what we pay for; and for good information the price is often dear. There is a design-analysis trade-off: what we elide by not investing up front in design is mortgaged to be paid later (perhaps with usurious compound interest) in complexity of analysis and viscid uncertainty. Uncertainty adds to the cost of forgoing investment in design. The only recourse to the conundrum you describe is greater investment in design, but we’ll have to ‘pony-up’ with time, energy and treasure.

It might be worth exploring IJ Good’s Reverse Bayes approach, that Robert Matthews and Leonhart Held have described as Bayesian Analysis of Credibility. I posted a few re-analyses of studies (that reported frequentist intervals) to derive the Bayesian prior that rendered the report credible with the null, or with the existence of an effect.

This paper mentioned by @AndrewPGrieve is also worth study

I didn’t know quite where to flag this, but this really excellent study, co-authored by Dr.Msaouel (see his posts in this thread), was just published:

I don’t have the training to properly understand the statistical portions of the article, but the overall quality of this piece of observational research stands out like a beacon in the sea of poorly considered, data-dredged (and therefore clinically non-actionable) muck that is today’s medical literature. Causal models are shown in the Supplementary Materials.

This is a perfect example of the colossal effort that that will likely be needed in order for newer causal inference methods to start making an impact in medicine. The research question was clear, important, driven by clinical observation, and supported by strong biologic plausibility. Multiple methods were used to try to establish support for causation. Sadly, such effort seems missing in >99% of research that gets published today. Congratulations to the authors- an example of how things should be done.

Thank you for the kind words. It was this particular challenge of linking renal medullary carcinoma with high-intensity exercise in persons with sickle cell trait that led us into causal diagrams and the do-calculus to guide our analyses.

We were essentially facing a similar dilemma to the one scientists faced in the mid-20th century with smoking and lung cancer: 1) we had a plausible mechanistic hypothesis connecting high-intensity exercise with renal medullary carcinoma in the setting of sickle cell trait; 2) we had animal experiments showing that high-intensity exercise in mice with sickle cell trait would damage the kidneys as predicted by our model; 3) we had observational clinical data but a randomized controlled trial would be impractical and potentially unethical.

And we realized that a key tool that Fisher was missing was a language to explicitly represent and test his causal assumptions. This is what led us into causal diagrams and the corresponding do-calculus mathematical syntax, which then helped us gather the clinical data and interrogate them in a principled way guided by our mechanistic causal assumptions to address the challenge of renal medullary carcinoma pathogenesis.

Before that, my Bayesian training instructed me to condition on everything. But what does that even mean in this case? How do we decide in this hopeless maze of statistical possibilities which variables to even extract? That approach made no sense. On the other hand, causal diagrams were remarkably similar to the approach we use to design lab experiments. They very naturally allowed the integration of our laboratory findings with our clinical observations.

We first did a test run of causal diagrams versus conditioning on everything in our institutional kidney cancer database. The causal diagrams yielded results consistent with established knowledge whereas conditioning on everything produced only noise. This ended up being Figure 3 of our causal diagram review article. Dan and I wrote that review to teach ourselves first. After those results, we knew that the approach would be the right fit for the renal medullary carcinoma analyses.

Thank you for the link to your interesting review about causal diagrams. Regarding Figure 3B, you write in the article that “On the other hand, body mass index (BMI) and age are not confounders, mediators, or colliders, and controlling for these “neutral” variables would neither increase nor reduce bias when estimating the causal effect of RCC histologic subtype on overall survival.” However, if BMI and age are associated with overall survival, wouldn’t you gain a more precise hazard ratio estimate of RCC histologic subtype by including them in the model as well?

Good question. Strictly speaking, due to the non-collapsibility of the hazard ratio, these “neutral” variables would increase the power of null hypothesis tests and not the precision. See here also for additional description of the phenomenon in odds ratios. To keep that review simple, we chose to focus only on bias reduction. We do elaborate more on collapsibility and precision in this review that again uses kidney cancer as its motivating example.

If you follow Fisher’s article that I linked to, as well as the whole debate he had with others at the time on smoking and lung cancer you will note that his main arguments are causal. Even genius level causal intuition can be helped by explicit representation of the presumed causal assumptions. We were not Fisher but thankfully we had tools that he did not.

Have you identified specific points in his train of thought where he demonstrably floundered precisely because he lacked access to (e.g.) graphical methods? (Metacausal reasoning may be involved here — causal reasoning about causal reasoning.)

Yes, an easy one is that he did not accept this counterargument which is very easily understood with graphs. Note that we used the same kind of reasoning in our renal medullary carcinoma study (sensitivity analysis and supplementary figure 2).

I assume you mean not an effect modifier by “neutral”? Because these measures are noncollapsible, adjustment for a prognostic covariate tends to reduce bias (not so for collapsible binary effect measures) and of course in a large study the bias dominates. I believe however that a study has to be really small for loss of precision to dominate if there is gross heterogeneity of treatment effect across these covariates.

The authors of the odds ratio paper also state that “It should be noted that for large sample sizes the mean square error will be dominated by its bias rather than variance component, so that for sufficiently large samples the adjusted estimator will always be preferable.”

Another fantastic question that merits clarification. A major challenge I sense here again, as has happened in a prior thread, is the different terminologies / notations used between fields. @Sander belongs to the rare breed that can seamlessly connect them but they otherwise remain a great source of misunderstanding.

Fully cognizant that it can be an oversimplification, it helps in this case to adopt Pearl’s demarcation between statistical (associational) concepts such as collapsibility, odds ratios, hazard ratios, regression etc and causal concepts such as confounding, randomization, instrumental variables etc. Our review focused on causal concepts whereas all your questions refer to statistical concepts. Thus, when we say “neutral” we do not refer to statistical effect modification (e.g., a multiplicative interaction in a regression model) but a variable that does not confound (in the causal sense), mediate or serve as a collider for the exposure-outcome relationship of interest. This is described both in our aforementioned review and other references offered both there and in this thread.

The difference in terminologies between statisticians and causal inference methodologists applies to the term “bias” as well, and you are very correct that I should have clarified this. @AndersHuitfeldt excellently elaborated on this distinction in this thread. Your comment on bias follows the statistical definition and the related bias-variance trade-off. I am working on a draft paper where we do, for example, connect the bias-variance trade-off with causal concepts, consistent with the notion that the demarcation between “statistical” and “causal” is far from perfect even though in our particular thread here it helps us maintain focus.