Hi @f2harrell and @Drew_Levy. I would like to ask about the understanding of some terminology and concepts and how they relate to the framework that has been presented in the RMS course.
Frameworks for prognostic research:
My understanding of prognostic research is largely based on the framework proposed in the PROGRESS series of publications, but even more on a publication by Kent et al 2020 “A conceptual framework for prognostic research” https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-01050-7. I found it very helpful, also for a general understanding of the literature and published work (especially Table 1 in the Kent et al paper).
However, according to the Chapter 1 of the RMS (Uses of models: hypothesis testing, estimation, prediction), you do not seem to make a further distinction within the “prediction” type of model strategies. This made me wonder about your perspectives on the above frameworks, and if there are reasons why you do not distinguish, for example, between studies/purposes about “association”/“predictor finding” and “prediction model development”?
Distinction between causal/etiology studies vs. prediction studies and implications for adjustment:
A further distinction is often made between causal/etiological and prediction research, as described, for example, by van Diepen et al 2017 " Prediction versus aetiology: common pitfalls and how to avoid them" https://academic.oup.com/ndt/article/32/suppl_2/ii1/3056968?login=true. And I think this distinction is related to the distinction between “estimation” and “prediction” uses of models made in the RMS Chapter 1. According to van Diepen et al, confounding and corresponding adjustment is not an issue in prediction research, whereas it is an issue in causal/etiological research. However, as I understood from the course, even if the purpose of a study is purely prognostic, we should always be concerned about proper adjustment if possible. Is this correct and how does this relate to the understanding of van Diepen et al? The van Diepen et al perspective is kind of intuitive to me and I would appreciate any thoughts on this and how it relates to the RMS course.
I also tried to use DAGs to think about the confounder adjustment in both cases, etiology and prognostic studies. As I understood it, in the case of etiological research, we are interested in a certain effect of variable X on an outcome Y. And of course, in this case, we need to adjust for confounders in order to correctly estimate the effect of interest. This is how I understand the use and application of DAGs as introduced in the course. However, when we want to develop a prediction model, we are interested in a whole set of variables (X1, X2, etc.) that can predict the outcome Y. So how do I know what to adjust for when there is no focus on a specific effect? Or in other words, how would I draw a DAG in this case, do I need to consider confounders of all effects between X’s and Y simultaneously? I would be very grateful if you could help me untangle this confusion.
These are great issues to bring up. I hope that @Drew_Levy can discuss the causal inference part of this in addition to anything else he wants to discuss.
In Chapter 4 I discuss 3 strategic goals at the end. Association assessment comes under the Estimation and Hypothesis Testing goals. But the text there neglects to emphasize the importance of confounder adjustment as being unique to these two goals, and as you stated, is not so important for a pure prediction task. You’re motivating me to improve this section, which I’ll start doing very shortly. I hope to update the online notes by 2023-05-30.
You are asking a difficult (and intelligent) question: difficult because I think a satisfying answer requires nuanced argument.
Reconciling all the many different perspectives and treatments of this problem (e.g., Kent; or van Diepen, etc.) is a reach. So I will attempt just to reconcile what we covered in the RMS short course.
Fundamentally, we should prefer models that make good predictions (see McElreath pp 13). This means not only models that predict well in the sample but also reliably predict future observations. We see from Frank and others that there a different ways to predict well. If you understand the data generating process (subject matter knowledge is available and reasonably complete) and have collected the essential or neceaary variables in that process, and carefully design your model in accord with the data generating process, you can expect that your model will likely perform well in the sample and in future observations (out of sample). This scenario recommends the causal models approach that I tried to represent in the RMS short course—as so many others have advocated (and done so better than I). A model that captures the data generating process may be expected to reliably predict outcomes even among new observations to the extent that it is faithful to the underlying causal mechanism.
However, a model might also predict well under a different scenario: you have a lot of information in a lot of variables and you include as much of that information as possible in your model (either because your sample size allows or you used data reduction techniques that preserve the essential information in the predictors), and you discipline that complex model to not overfit the sample (e.g., with shrinkage) so the model has acceptable expected predictive accuracy.
So there are these two approaches to a model with good expected predictive accuracy: one approach may be said to be more mechanistic and the other more empirical. Confounding complicates the former—the causal or mechanistic—approach; but not so much the later–empirical—approach. Of course, you are less able to draw conclusions about specific variables and effects in the later approach; but if your interests and objectives (and incentives) do not include understanding the particular role of specific variables for outcomes, then the non-causal ‘empirical’ approach may be useful and gratifying. As the later does not leverage causal (‘mechanistic’) understanding for its expected predictive accuracy, it does not benefit from a causal DAG in selecting variables for adjustment.
The important or essential part of your query, “how do I know what to adjust for when there is no focus on a specific effect”, suggests that your objectives are primarily prediction and your interest is primarily in optimizing expected predictive accuracy, so the later scenario above seems more likely consistent with the motivation for your question. And it is Franks recommendations in RMS (such as data reduction maneuvers and shrinkage) that are especially helpful in discipling the model to generate good expected predictive accuracy in the empirical scenario.
Thanks for your helpful insights and feedback! I will make sure to read chapter 4 of the RMS again. Your two comments have already helped me to clarify and add some complexity to the frameworks/perspectives mentioned.
Yes, it is very difficult to reconcile the different perspectives/frameworks! So thank you very much for taking the time to share your thoughts and experiences!
This is such a great summary, Frank - just sent the link to one more colleague, and have been using it ever since in my courses - thank you so much for this wonderful set of guidelines.
I have an RCT where there is an intermediate variable and an ultimate endpoint of survival (a time to event endpoint).
I want to establish that the intermediate variable is a casual mediator of the treatment effect on the ultimate endpoint i.e., OS.
I wonder how can I do that and is there a specific R package that can help me with this? The struggle here is the fact that the confounders only affect the intermediate variable or the mediator and the main endpoint (OS) while the treatment is randomized and not affected by confounders. So even though I tried to use the mediation package, I have been facing difficulty in formulating the two models. In my mind, the outcome model is
My question is regarding g-estimation. Is it accurate when I say that
For a continuous treatment, traditional regression estimates are g-computation estimates under the assumptions of a) existence of counterfactuals and consistency of counterfactuals b) Positivity and c)
Rank preservation.
G-estimation and G-computation are different approaches to estimation of causal quantities. I am not sure which one you are interested in, but your criteria are not correct in either case. The easiest way to see this is to consider a time dependent confounder.
Hi Anders I am sorry I did not understand how adding time as a confounder would turn this into g-computation ? Here’s an example showing g-computation when treatment is binary,
G-computation was developed to handle situations where there is a time-dependent treatment and time-dependent confounders. If you want to understand how G-computation differs from regression, you have to think about situations where there treatment can change over time. There is no advantage to using G-computation if treatment does not change over time, this setting is only used as a toy example to illustrate how the method works in the simplest case
G-computation can rely on regression models to estimate the components of the G-formula. Usually, you need more than one model, and the idea of G-computation is that it shows you how to produce a prediction by putting the predictions from each model together correctly. In the special case of a toy example where treatment is time-fixed, you only need one model, and the predictions from G-computation will correspond to the predictions of that model. This will be the case regardless of the criteria you have listed (the first two of which are required for the validity of the G-formula and any other method for causal inference, the third is not relevant in this setting)
Got it, I was assuming that to look at the effect of A , one has to consider only the intercept because B,C,D =0 and intercept is what represents A. Thank you Frank, also I will move this to a different thread.