As a summer exercise I have been testing the code to standardize the exploration of the concept “vibration of effects”.
Basically the code tries to estimate the effect of a variable X1 on survival, taking into account repeatedly B combinations of K covariates.
For example,
cph(south ~X1 + X2 + X3…)
cph(south ~X1 + X2 …)
cph(south ~X1 + X3…)
There can be thousands of ways to combine the confounding variables.
To test the code I have compared the effect of treatment A vs B, with up to 15 confounding variables, so that there would be at least 10 degrees of freedom per covariate. I show the graph that I have achieved, which is impressive. The most curious thing is that with some combinations of variables the effect is reversed, which is sometimes statistically significant.
However, taking into account the inflation of type I error by multiplicity, I believe that pvalues cannot be interpreted in the standard way.
So I wanted to ask your expert opinion.

I’m trying to think of which statistical principles would give rise to such a method and I’m at a loss. This seems to be far afield from a rational Bayes approach where you encode all model uncertainties (e.g., which model form to use; which variables to use; which transformations to apply to the variables) then pay the price by having a wider posterior distribution that accounts for all these uncertainties. The uncertainty in the exposure effect will be of key interest.

I read the article because Ioannidis usually says interesting things. I think the spirit was to describe that the basic conclusions of an observational study on the effect of X can change drastically depending on the set of covariates used as confounding factors (e.g., researchers play with the model but don’t reflect that in Methods). As I understood, the idea was to expose this problem graphically. My concern was that the p-values really were not trully interpretable (multiplicity).

I concur with “albertoca”. Ioannidis usually says interesting things and the purpose of his paper was probably calling attention to a known problem in both observational and experimental research (i.e. RCT). I would have been extremely surprised if parameters from models with different sets of variables were consistent (i.e. if they had little “vibration effect”). High variability should be expected because each estimate of the association is “by definition” a different estimate (each is estimating a different association). The vast majority of them are biased, and only a small fraction would be unbiased (i.e. only a small fraction would be “causal” parameters). Questioning observational studies based on these results is questionable. If one only includes in the model sets of variables that block all biasing pathways, then the estimates would be consistent (asymptotically). It is unfair to judge observational research based on flawed arguments.

It is not good statistical practice to remove potential confounders because of any observed relationships with the exposure variable or with the outcome. Overadjustment is better than underadjustment.

I agree with both of you that selection of variables to include in a model should not be based on its observed effects on the outcome or the exposure. I’m not sure about “overadjustment” been better than “underadjustment”. Do you mean: We should adjust for X, if we’re doubtful whether X is a confounder or not? It seems to me that in those cases we should present both analyses, with and without adjustment for X and let the reader make her own call. After all, adjusting for a non-confounder could introduce bias, and will likely decrease precision.

No, no. Don’t present both analyses, as biased readers will pick the one that agrees with their bias. Yes adjust for X if it may be a confounder, since the data are not reliable for informing us about whether it really is as a confounder. This is why you don’t use statistical tests to select variables for propensity scores. Note also that any attempt at parsimony using supervised learning will result in invalid standard errors (too small) for the exposure effect (and for all other remaining effects in the model).

The issue that concerns me is that I cannot know to what extent authors have altered the vector of confounding variables according to observed effects, but I would like to know to what extent this data was sensitive to these manipulations. The question is whether the incorporation of a graph such as the one above as an annex to the analysis of observational studies could be useful in this sense.

There we differ. I don’t want to know what could have happened had authors used a bad analytical strategy by attempted to be parsemoneous at the wrong time. But you have a good point in this sense: if the authors removed any variables they should disclose that, then also do the correct analysis.

I’d still do an analysis with and without adjustment for X and report both estimates. I don’t want other people to inherit my biases. Supervise learning (machine learning?) was not what I was thinking when I argued a more parsimonious model may be preferable. Suppose a causal effect could be identified by adjusting for either of two sets of variables: (X,W, Z) and (X, V). The model including (X, V) will be as valid as the model including (X, W, Z) and will likely be more precise. It should, therefore, be preferred. On the other hand, we will never know which sets of variables an investigator tried when fitting his model, unless he himself tells us. If he presents a “vibration of effects” graphs I would be very concerned about his findings, as this provides evidence he has “tortured the data” to make them confess something that matches his beliefs. In other words, he has probably fooled himself. A way around this problem is to pre-define, based on subject matter and mathematical rules, which variables should be included in the model, stick to those variables, and report and justify any deviation from the proposed model (i.e. inclusion of new variables or exclusion of pre-selected variables). Of course, this would be a significant change in current practice, but it’s already on its way.

IMHO reporting both adjusted and unadjusted analysis is a dangerous bias-producing strategy, and the unadjusted estimates don’t even have a scientific definition. Good points about pre-specified models.

Suppose one is certain X and W are confounding factors. One also knows Z is a risk factor for the outcome, but is uncertain if Z is a cause of the exposure. If in fact Z is not a cause of the exposure, then adjusting for X, W, and Z will result in a valid estimate of the effect of the exposure on the outcome. But adjusting for X and W will also result in a valid estimate of the effect, and it will be more precise (since a parameter for Z was not really needed). I fail to see why the X and W-adjusted estimate has no scientific interpretation.

It will actually be less precise to not adjust for Z because of e.g. non-collapsibility of the odds ratio. In my book it is a statistical mistake to fail to account for easily accountable outcome heterogeneity (e.g., to ignore the fact that older patients die sooner, even if the entire age distribution is balanced across exposure levels). For linear models, omitting Z will not bias the exposure estimate but will make it much less precise, because omitting Z will elevate \sigma^2.

I think I understand your reasoning. More or less I think you mean that you already know that a bad specification damages the model, so you don’t find the point of making a graph with vibrating effects. What I’m wondering is whether or not the graph conveys something useful about the underlying structure of the database. The reason is that the method for effect vibration involves remaking the model thousands of times with all possible combinations of the covariate vector, rearranged in every conceivable way. My question is whether the distribution of those thousands of distributions, according to the laws of large numbers, carries some information or not that is generalizable about the underlying structure of that database.
For example, there may be databases with a covariable structure such that the effect estimates are stable against the researchers’ choices. In other databases, this may not be the case.

I think I fall on the side that the problems with inference outweigh anything you can meaningfully discover. Here, while it might seem the procedure allows you to exhaustively probe every possible combination, haven’t you only defined one of many possible estimands you might want/need to get out of the data? Similarly, one or more of the covariates may be mediators or effect modifiers, in which case the model is misspecified and the p-values worthless anyway. On the other hand, if you’re certain your model is correct and meaningful in the form of Y ~ X + W, then variations would be due to incomplete confounding control (incorrect functional forms), measurement error (e.g. if certain models contain more or less error-prone variables), or parametric issues such as non-collapsibility or non-positivity. I think you’re right that it could graphically represent the range of output of a particular form given data, but then you are still fooled into thinking that you have exhaustively represented the possible distributions of even something like an average treatment effect.

I think Ioannidis was trying to take another stab at nutritional epi but then got caught up in stretching his point too far by trying to make inferences from this exercise.

However, I do think there is some sense behind what he’s doing as a kind of sensitivity analysis in the following way:

People could reasonably disagree about the causal structure of the data in an area where there’s not much experimental data to back up assumptions.

One could therefore come up with different plausible DAGs and different sets of covariate adjustments.

Running all of these would be an honest way of doing your analysis.

Beyond that, I’m not sure what the point of this is.