Thinking Clearly about Association Studies (Risk Factors and Causal Salad included)

Following a brief email exchange with Frank, he asked me to post my questions here. I’ve made some light edits. Sorry for the wall of text in advance :wink: .
If you want to skip to the questions, go to the section “Questions”.

Context

I have entered the world of science through (retrospective) clinical research in Radiology/Oncology, exposed mainly to non-formally stats-trained clinicians/biologists/scientists. By now I have transitioned to (Clinical) Virology, not least to have time to foster my statistical interests and to be able to access prospective and experimental research more easily.

For quite some years I have been in what I like to provocatively call an “epistemological crisis” about many of the retrospective studies to which I had initially been exposed.
You know the drill: We have disease/outcome Y, an interesting/new measurement x_1 and adjust for some other variables x_2, x_3, .... Then we claim association between x_1 and Y. Variations on the theme are:

  • looking at multiple similar outcomes Y_a, Y_b, Y_c, ... to show some “robustness of association”.
  • looking at multiple different outcomes (and then maybe only report those that are interesting, i.e. statistically significant for x_1 after varying covariate adjustments, …).
  • performing various (non-linear) transformations on x_1, including combination with other x.
  • Maybe even some prospective-ness or secondary/subgroup RCT analysis thrown in.
  • More recently (somewhat actually suprising to me) even so called experimental work is in fact often not experimental. Rather anything (wet) lab related gets claimed as experimental.

This feels connected to what Richard McElreath calls “causal salad”.

I have been highly conflicted with the value of these studies. In the beginning I went along assuming a logical resolution would follow once I learn enough. However, I have the impression from the online (applied) stats community that these studies are almost always useless and in the bigger picture harmful.

The applied researchers justify them as a “first step”, “you have to start somewhere”, and an “indirect glimpse” at “the actual question”.

Yet, the applied researchers seem not to be able to characterize (implicit) assumptions for their studies to be useful, like how “the actual question” and the “indirect glimpse” of a study in question logically relate to one another such that one couldn’t plausibly make the case for “well the direction of an effect in question could also be the other way around” such that the entire argument of a given study reduces to “We (authors) think the relationship in question is X”.

On top, as you know, it is often claimed that the goal of these studies is explicitly “association” and nothing else.

I generally agree with the impression that association is the purely descriptive (pun not intended) term that evolved into the “lowest common denominator” stemming from a (well-intended suspicion) against claims of causal inference in observational studies.

In my experience, when I ask applied researchers about what the larger goal with these association studies is, and whether it’s descriptive, predictive, or causal/aetiological I get in my opinion, confused answers along the lines of “it’s association - which is an end in itself”.

Having been confused about this state myself: rationally, I lean towards this research being generally useless or even harmful; still, I am inclined to assume that my understanding might be still incomplete.

My hunch is that the majority of the applied (clinical) research community does not know because many are essentially performing what Feynman called “cargo cult science”. I assume mostly without malicious intent. At the same time, my (statistical) knowledge is a lot based on intuition so that I often fail to put critiques forward which is simple enough that it can be easily understood while at the same time mathematically sharp enough that it cannot be dismissed by (A) hand waving (B) in my opinion, lip service to the typical stuff one finds in limitation sections, or (C) a claim to the critique being too technical. And sometimes I even encounter people with (some) formal stats training (e.g. Epidemiology, Bioinformaticians, …) who seem to be fine with this general kind of scientific approach.

Questions

The (pointed) questions I want to ask your opinion(s) on are:

  • Can there be something like an “association study” that does not have a descriptive, predictive, or causal intent? Can there be any study at all that does not have one of the three goals?
  • Are most of said studies at best descriptive? With some misguided causal-wording under the hood?
  • If so, can they still be useful even if unknowingly descriptive? (think probably misguided adjustment set for causality (i.e. using “more adjustments are better adjustments and dividing number of observations by e.g. 10 or 30 to conclude the maximum number of covariables…), or, in a prediction setting (borderline) useless tools to assess added predictive value in a seemingly muddled prediction setting”)
  • How and where (!) do you walk the fine line between the two extremes:
    • “Every statistical output is at heart association. Wrong things will be weeded out as time goes to infinity if we trust the scientific process. So just go with the flow and on average it will be fine.”
    • Any study that fails to clearly articulate its objective, is likely useless or harmful and should therefore not be published.

Thank you all in advance! :slight_smile:

8 Likes

Kezios went and looked, which is more than most critics of this practice have done. In a random sample of 100 observational epidemiologic studies, 13% stated explicitly causal aims and 69% used associational framing while clearly pursuing a causal question. Two were predictive. Two were descriptive. Fifteen had goals so vague that classification was impossible. The “association study” as a distinct scientific category is essentially a fiction in practice it’s nearly always a causal study that has decided not to say so.

Her term for the dominant mode is seemingly causal. A specific exposure disease relationship is front and center, but the aims are written as “to examine the independent association between X and Y” or “is X a risk factor for Y?” This isn’t a genuine fourth epistemological category sitting alongside description, prediction, and causation. It’s a rhetorical posture causal intent with the causal accountability stripped out. The authors want you to read the finding causally. They just don’t want to be held to the standards that causal inference requires.

The alignment data make this concrete. Only 9% of studies achieved full alignment of goal, methods, and interpretation. But among explicitly causal studies the rate was 38%, versus 4% for the seemingly causal ones. Stating your goal in causal terms , effect, cause, intervention , more than triples your probability of actually executing the analysis correctly. The associational framing isn’t a neutral stylistic choice. It’s operationally corrosive. When you don’t name the causal question, you don’t build the DAG, you don’t specify the sufficient adjustment set, and you don’t feel obligated to defend your identifiability assumptions because officially you never made any.

Among studies with unclear goals and adequately reported methods, 91% used an outcome-focused variable selection strategy , selecting covariates based on their association with Y alone and, every single one of them went on to interpret or discuss coefficients inappropriately. This is the “known risk factors for the outcome” heuristic and its companion, the EPV rule, operating exactly as trained. It feels principled. It produces a tidy multivariable table with p-values and confidence intervals. What it actually does is ignore the exposure confounder relationship entirely, create systematic risk of mediator and collider inclusion, and generates effect estimates whose bias is uncharacterized and unacknowledged. The analysis looks like causal inference. It satisfies none of its requirements.

I review manuscripts in veterinary medicine, where this methodology is essentially the default, the Kezios numbers are a reasonable lower bound on the problem. Her sample came from the top five general epidemiology journals,fields with far more methodological infrastructure and causal reasoning tradition than veterinary clinical research. If 91% of unclear-goal studies in those journals used outcome-focused variable selection and then over-interpreted the results, the corresponding figure in equine or small animal observational literature is unlikely to be lower.

Associationstudies.pdf (680.0 KB)

6 Likes

+1

Agree completely. When poorly-considered, poorly-designed, and poorly-reported clinical research gets press coverage, there’s no question it can harm patients. I’ve seen this happen countless times.

Here’s another recent publication you might find useful:

https://www.bmj.com/content/bmj/392/bmj-2025-085749.full.pdf

4 Likes

This one one way to think about stages and goals of research, with some overlap between them.

  • Initial data analysis which is blinded to any linkage between X and Y and should proceed both experimental and observational data analysis
  • Descriptive studies, which deal mainly with characterizing univariate distributions and relationships among Xs using correlation matrices and variable clustering and other unsupervised learning techniques
  • Descriptive studies of relationships between individual Xs and Y without prediction or inference
  • Prediction, where explanations are not very important and parsimony is not usually a goal
  • Association studies that try to find associations between some Xs and Y that are not trivial, i.e., are not explained by boring Xs. These are often improperly used to make causal statements, or may be used to inform a causal analysis, assuming that a factor needs to have a nonzero association with Y before you care about causation.
  • Development of causal diagrams using only subject matter knowledge, then feeds into the next item
  • Formal causal inference

An open question is whether association studies help or hurt the last two steps, i.e., whether empirical association analysis should inform the development of the causal diagram.

There there is the always important issue of how prospectively the study needs to be designed in the first place. It is extremely important to elicit expert advice about potential confounders before revealing to the experts which variables are available in the dataset. This avoids data availability bias.

6 Likes

The circular logic required to answer this question always does my head in. If everyone agrees that DAG-free studies touting “associations” are insufficient for causal inference, how can we then justify using these same studies as “evidence” to inform construction of a DAG for a subsequent observational study on the same topic (??)

Published associational studies (often derived from administrative databases) are often DAG-free, with weak supporting biological plausibility. Potential non-causal explanations for the findings can be easy to identify. There is a real risk of harm to patients from this type of research. Often, authors gaslight readers by saying that their study isn’t capable of proving causation, while simultaneously either implying, or outright advising, that they act on the results.

@Pavlos_Msaouel has been using DAGs elegantly to inform design of his oncology RCTs. The DAGs in his papers seem, clinically-speaking, eminently sensible. I’d be interested to hear how he develops them.

I’ve asked myself why Pavlos’ concise DAGs, applied for the purpose of improving RCT design, seem so much more credible to me (as a physician) than the few DAGs that I’ve seen constructed for the purpose of deriving causal inferences from observational studies (which look, to me, like hopelessly subjective birds’ nests). I thought of a few possible explanations. First, I suspect that the evidence underpinning his DAGs is of a much higher volume and rigour, as necessitated by the high-stakes nature of the drug development gauntlet (?) Second, it occurred to me that his incentives, and the incentives of drug sponsors, are very different from those of academic observational researchers. As non-scientific as this might sound, it’s human nature for us to factor in the researchers’ incentive structure when gauging the credibility of published research. And finally, the consequences of him getting the DAG “wrong” will work against him, not in his favour (as can potentially be the case for sloppily-constructed DAGs in the observational research realm).

Different types of researchers can be differentially incentivized to construct DAGs in the design phase of their studies. In using DAGs to inform oncology RCT design, it seems like Pavlos is trying to optimize assay sensitivity (the ability of his trials to detect signals of intrinsic therapeutic efficacy). If his DAGs are unreliable/incorrect, the consequence is not likely to be “false” signals of therapeutic efficacy (which drug sponsors could then milk for profit), but rather failure to detect signals of therapeutic efficacy that are actually present (which would cause patients to lose out on potentially life-lengthening therapeutic advances). His incentive (and that of the drug sponsor) to both use DAGs and to “get them right” is very strong.

The incentive structure around DAG construction seems more complicated for observational researchers than clinical trialists. An observational researcher who is motivated primarily by the prospect of contributing one strand to an evidentiary web for an important causal clinical question will, ideally, first survey subject matter experts to hone the clinical question and then construct a DAG. Only THEN will/should he examine his data source to see if he has the right data to address the question. If he is honest, he will likely OFTEN find that the administrative data at his disposal are insufficiently comprehensive/granular to inform the clinical question- and he will abandon his plan. In this situation, construction of a solid DAG has effectively penalized him personally (as he won’t be able to publish and advance his career- and, therefore, feed his family). And even if he DOES make an effort to construct a DAG, there’s a good chance that the quality of the DAG will be inversely proportional to the likelihood of the research getting published (if the DAG contributes to a less eye-catching result). This incentive misalignment does not reflect personal failure on the part of the researcher, but rather a corrupted research ecosystem.

The “publish or perish” culture in academic research needs to be addressed before good research practices will take root. The onus is on funders not to fund badly-designed research and on universities not to make promotion contingent on publication metrics.

4 Likes

Thank you all three of you for your input, this is highly appreciated!
The papers are now all added to my library :slight_smile:

I fully share the sentiment. Playing devils advocate:
How would you engage researchers that claim they in fact do not want me to read the finding causally? Let’s exclude malicious intent. In my experience the researchers in question have been brought up in this mode of thinking and are so entrenched in it that they truly believe there is a fourth epistemological category (though they don’t now about the “3 categories” in the first place).

Is the only way out going back to basics and teaching understanding from the ground up to the point where they can understand enough statistics that they see that the practice isn’t logically valid?

1 Like
  • IDA seems close to Tukey’s EDA. Is this observation correct?
  • Am I correct in reading f2harrell’s characterization of descriptive studies as: A descriptive study can hardly include a multivariable model of Y? (exceptions to this clunky rule are granted)
  • I agree with ESMD that I find this and the idea of a non-zero association as a “first step” unsettling. If we assume this might be true, again devils advocate, grants this plausible deniability of any scientific confusion? Or in the best case, that association studies are somewhat justifiable?
    Do you happen to have more literature on the view they might be some value in these studies?
1 Like

That is very kind of you. Most of our work in the lab and clinic involves experimental designs, which are well suited for causal diagrams both preclinically and clinically. When we do observational research it is usually connected with experimental models (e.g., this work evaluating high-intensity exercise as a risk factor for renal medullary carcinoma in the presence of sickle cell trait).

It is typically much harder to encode the causal assumptions of purely observational studies. But in our practice I have found the concept of Target Trials to be helpful particularly because it emphasizes the importance of carefully specifying the “time zero” of the interventions. Doing so helps notice and potentially reduce common design biases such as immortal time bias and collider selection bias. A recent high-profile example of collider selection bias here. These design biases can influence causal inferences at least as much or more than other more commonly cited biases such as confounding.

There is indeed some misalignment between the “publish or perish” academic culture and truth-seeking. This is supported, e.g., by our (observational :blush:) data showing that academic oncology RCTs are more likely than industry ones to commit the Table 1 fallacy, omit key visual elements of survival plots, and incompletely report toxicity data. On the other hand, industry incentives can also be misaligned resulting in more frequent unsupported claims of differential treatment effects, and I recently read this fascinating history of adaptive Bayesian clinical trials (certainly of interest to datamethods contributors) discussing how the industry sponsor (and the academic investigators) chose to not report the ROAR trial using the estimates from the Bayesian hierarchical modeling of the primary endpoint and erroneously decided to report the frequentist estimates – despite the FDA prompting them to prioritize the Bayesian hierarchical analyses.

Also related to this thread, this Judea Pearl festschrift book chapter by @Sander superbly argues that every real data analysis rests on a causal model of the data-generating process, even for purely descriptive goals.

4 Likes

Agree. My collaborators frequently point out that I should use precise language, not vague language. I often lack confidence, so this must be my problem. Then, when I use causal language, they criticize me harshly, saying it’s observational research, you shouldn’t use causal language! :person_shrugging:

About three years ago, I strengthened the adjustment for confounding factors in an analysis and reportred that the results was no statistically significant. My boss pointed out that no top journal would accept an article without statistical significance (I donnot think so).
I insisted on not making any revision and showed the reasons. And even without enough adjustments, the lower limit of the 95% confidence interval was 1.00. He then threatened to withdraw data support if I made the adjustments. Finally, I quit from the project.:laughing:

5 Likes

This has also been confusing / bothering me for a long time. We use expert consensus to draw a DAG, but the expert consensus is based on poor studies and clinical experience of us biased humans.

I think it’s helpful to keep in mind that Causal Inference with DAGs is deductive. We try to agree on some theory (the DAG), and if we believe that theory to be true, our conclusions should be true as well. You could have a very different theory than @Pavlos_Msaouel and thus not believe the conclusions. In think that DAGs serve mostly a bureaucratic function (as much of statistics imo). They allow stakeholders to speak a common language and agree on some rules that guide decisions.

I’d also be very interested in @f2harrell 's view here. Personally, I’m not so sure / on the fence. It might still be useful to make et ceteris paribus comparisons, which don’t necessarily have a causal interpretation. I could for example be interested in whether men have a larger VO2max than women, but I want to compare men and women of equal height and weight.

I could model (y \sim \text{gender} * (\text{height} + \text{weight})) and use that to predict some contrasts for men and women of different heights and weights. I think the the interpretation of these results still can be descriptive, i.e. “men and women of equal height and weight have different VO2max”, vs “being male causes higher VO2max”. Happy to be convinced otherwise though.

I’d appreciate some concrete input on the following:

  • We are interested in the pretest probability of a certain biomarker, based on some clinical variables
  • Our sample size for a true prediction model is too low and the study won’t get larger
  • There’s no literature on which clinical variables might be causal / predictive
  • Is it really better to not try to identify (through careful multivariable modeling strategies as taught in RMS) some variables that might be important to inform future data collection? Sure, our power is low, so we might miss some, but you still have to start somewhere

For example, we might find that radiotherapy appears to be an important predictor. This could be because of radiotherapy itself or because people who receive radiotherapy have some other important features. As long as we clearly acknowledge this, isn’t this at least worth knowing and potentially exploring?

6 Likes

Yes with the exception of hiding Y for most of the IDA.

Yes. Multivariable models can provide the best descriptive statistics because they can analyze more than two variables at a time. A few examples:

  • Showing the relative explained variation of a set of potential confounders in predicting treatment choice. It’s amazing how many papers using propensity scores fail to decode the scores.
  • Showing age-adjusted outcomes by strata
  • Computing the first canonical correlation relating a set of variables to another set of variables, to gauge the total strength of relationship between the sets
  • Using nonlinear principal components analysis to find pre-transformations of Xs (how one X relates to all the others Xs)
2 Likes

Blah- Jiaqi, you should write a book about your experiences some day :slightly_frowning_face:

Choosing not to draw a DAG because 1) you don’t know how; 2) are afraid it will delay your manuscript/publication; and/or 3) are afraid that a DAG could force more careful covariate adjustment and generate a less eye-catching result is one thing. These are passive errors- errors of omission. But the scenarios you describe go way beyond this- they represent an active effort to distort study results. This egregious type of behaviour most definitely reflects a personal failure on the part of the researcher.

I’m sure your experiences are not unique and they certainly don’t restore confidence in the observational research ecosystem.

4 Likes

I think some of the causal people take issue with this being called descriptive research, because confounders are a causal concept. For example here. Your emphasis on explained variation vs presenting a table with conditional coefficients and (implicitly) interpreting them as causal is important, but maybe it’s also a bit of slippery slope?

2 Likes

I don’t see this as a problem. I would not be trying to learn that a specific factor is what causes a treatment to be selected but rather (1) how non-random is the treatment choice and (2) what are the biggest players in predicting treatment choice. The latter could be deemed an association study.

2 Likes

Advocate here. Agree that there are many examples of the misuse of associations in observational studies that mislead and even harm patients.

On the other hand, even if not commonly applied, the natural history of the disease changes the degree of inference we can have when an intervention is associated with a good outcome. High-quality natural history data can sometimes replace a placebo group in clinical trials. If the natural course is well-documented and severe, a treated group showing better outcomes can provide a statistically valid inference of efficacy. An example: Tazverik received accelerated approval in 2020 for Epithelioid Sarcoma. Researchers established an external control arm to track the natural history of patients on standard therapies, which helped prove the drug’s benefit in this rare, aggressive cancer. This being the rationale for FDA accelerated approvals.

2 Likes

Makes sense although I don’t know how you’d validate that the external controls mimic natural history and don’t have any strange selection process going on.

Check out the new paper from Hernan et al. where he questions some of the above:

I wonder if any exceptional responses [1] were observed. @karlamoPA would you be aware of any? Would there be a published study report with this kind of info?

  1. Marx V. Cancer: A most exceptional response. Nature. 2015;520(7547):389-393. doi:10.1038/520389a

Thanks. I think I might be misunderstanding you here. The example seems to imply, that a descriptive analysis can very much have multivariable models whereas in your earlier statement I thought you said they (usually/often) cannot. Would you mind clarifying once more for the slow thinker?

In the two sections I’m quoting the examples sound descriptive to me but somehow I’m still stumbling over the word “predict”. I might be nit-picking and overthinking… Do you see any issue in your (or the general usage) of using “predict” in this and its widespread context? Or do you think the correct interpretation can be easily assumed from context?

For the example I gave, worth noting: Tazverik (tazemetostat) was voluntarily withdrawn from all global markets and indications on March 9, 2026, due to safety concerns. For dire indications with no effect therapy, arguably best supportive care can be a control that will be acceptable to some patients and to IRB reviewers.

1 Like