Should one derive risk difference from the odds ratio?

Yesterday, I said that I was leaving this discussion. After having slept on it, I changed my mind. My work is being publicly ridiculed by a Professor of Clinical Epidemiology who clearly has not made even a rudimentary attempt to understand the structure of our argument. I consider this a shocking violation of academic ethics. This calls for a public response, even if I am unable to do so within the constraints of Frank’s request for civil discourse.

The structure of our argument is that we consider four types of “switches” which determine whether treatment has an individual-level effect:

  1. In no switch is present, treatment has no effect
  2. If a switch of type B is present, treatment is a sufficient cause of the outcome
  3. If a switch of type C is present, treatment is a sufficient cause of not having the outcome
  4. If a switch of type D is present, absence of treatment is a sufficient cause of not having the outcome
  5. If a switch of type E is present, absence of treatment is a sufficient cause of the outcome

If the effect of treatment is determined only by one of these types of switch (in all members of the population), and if the prevalence of that switch is stable between groups, then a characteristic effect measure associated with the switch will be stable. The particular effect measure depends on the type of switch.

We provide a thought experiment to illustrate how this works for the special case of a switch of type B. This leads to stability of the survival ratio. We clearly state that if the effect was determined by a different type of switch, then another effect measure might be stable instead. Then we proceed to show the biological reasons why switches of type B are more plausible than switches of type E for most settings where an exposure increases risk of the outcome.

Nowhere do we express “surprise” that the survival ratio is stable. The purpose of this section is to illustrate how our causal model works, to give the reader an opportunity to evaluate whether it captures their intuition about how the biological system works.

All of this is very clearly explained in the manuscript. Having to repeat it on this forum is a waste of my time, and even worse, a waste of the time of readers. Methodologists will give me a limited “budget” for how much of their attention they are willing to give, and I am not happy about having to spend this on refuting hostile bad-faith criticism on matters that are very clearly explained in the paper.

Its not good practice to personalize people or make threats on a blog - if you don’t like the idea call it nonsense but that also will not stack up without a valid reason. If you feel criticism of ideas = ridicule or violation of ethics or other social problems then why are you posting?

Coming to your idea - it does not stack up on the math regardless of the philosophical backdrop
if r0(Y)=a and r1(Y)=b then you define b as follows in your paper:
b=a+(1-a)0.01
therefore
b=0.99a+0.01
1-b =1-0.99a - 0.01
1-b = 0.99(1-a)
RR(AH)=(1-b)/(1-a) = 0.99(1-a)/(1-a) = 0.99
Thus the math is clear on the point that you predefined RR(AH) = 0.99 so why are you surprised that its stable at 0.99?

3 Likes

What threats did I make??

I did not “predefine” that the survival ratio is 0.99. Rather, I predefined a causal model, and showed that this model implies that the survival ratio is equal to the prevalence of B. The maths involved is simple, and you have summarized it here. The idea is to illustrate the logic of this mode of reasoning, in its simplest possible form. Later, I provide a way to extend this kind of reasoning to settings where there are multiple types of switches interacting with each other.

I also give you a different causal model, based on switches of type E. With that model, you can show (using similar maths) that the standard risk ratio is stable. Then I discuss at length why the model based on switches of type B reflects biology better than models based on switches of type E.

This to me is the most non-controversial part of the discussion and I concur. I’d just add that it is a useful exercise to find out if this process is made worse or better by allowing for non-constant treatment effect on some “reasonable scale”. When the sample size is not large, adding even “real” interaction terms to models may give worse predictions. A very simple example of this was given here. This is just the old bias-variance tradeoff and I wanted to point out that performance of personalization methods is sample-size dependent.

4 Likes

I am converted - will not look at product terms in GLMs the same way again. If you remember, my first post on this blog was about this but at that time this was a bit fuzzy in my mind but its clearer now.

Just to take an example - if we analyze the data example in Sander’s part 2 paper in JCE, the correct OR is 4.5 for treatment and what we get per stratum or by adding a product term is just an artifact of the sample - right?

1 Like

I am not sure I am looking at the same table as you… Are you seriously arguing that in Table 2, which is used to show heterogeneous treatment effects, that the appearance of heterogeneity is just an artifact of small sample size, and moreover, you can somehow infer that the true conditional odds ratio is 4.5 in both strata? Where does the number 4.5 even come from?

This is not what Frank was arguing. He was arguing that even if there is true treatment effect heterogeneity, the bias due to omitting the interaction term may be less problematic than the loss of precision from including the interaction parameter. I am not fully sold on that argument, but this is a fairly standard view in statistics which a reasonable person might argue for. I am genuinely confused about how someone would think this means that any apparent differences between strata have to be an artifact…

My personal view is that trading off bias for precision is “cheating” and gives you uninterpretable inferential statistics. In my opinion, the correct move is to accept that the standard errors will be large unless we can increase sample size. I would rather have an honest but imprecise prediction. But that is an argument for another day, and not really related to this thread

If I understand this correctly, your view would be contrary to known results in decision theory where the admissible frequentist procedures are in the class of Bayes’ procedures. Gelman goes into detail about bias/variance trade-offs here. here:

Blockquote
How can this be? Where does Bayes come in? For a Bayesian, the problem with the “bias” concept is that is conditional on the true parameter value. But you don’t know the true parameter value. There’s no particular virtue in unbiasedness.

I’m willing to have this discussion in another thread, if you make one. It would require me to move out of my comfort zone into areas outside my expertise, with much higher risk that I will say something wrong. That could still be an interesting, something I might learn from, but I would prefer clarifying that it has no connection with this manuscript

1 Like

Agree with R_cubed. Take the case of meta-analysis - any empirical set of weights must introduce bias (over natural weights) but we accept that because we only do a MA once and therefore we are chasing the bias-variance trade-off. If this was not the case and we simply want unbiased results - the arithmetic mean of study estimates would be fine - I believe Shuster argued for this in Stat Med a while back but that was incorrect (my view)

I find the structure of this argument very strange. You seem to be arguing that because the statistical community has settled on a form of a meta-analysis that introduces bias, therefore accepting bias is sometimes the right thing to do.

Maybe the statisticians are right. If so, reproduce their argument instead of just showing what conclusion they arrived at.

It is another issue worth exploring in another thread. The Bayesian skeptical prior approach of the interaction being “half in and half out” of the model comes into play. To me the most rational way to deal with the variance-bias tradeoff. Some general idea of needed sample size before any interactions should be added to a model would be helpful.

I think the concepts and methods for all this were pretty much worked out long ago. By the 1990s there was a literature on allowing product terms and nonlinear trend terms “partially” in the model using estimated (empirical-Bayes) or partially-specified (semi-Bayes) priors on the terms. A key advantage of these partial-Bayes or partially penalized methods is that they do not require complete prior specification, and their theory and computation fall neatly under hierarchical (multilevel) and penalized regression. They were briefly discussed in Modern Epidemiology Ch. 21 (in both the 2nd ed. 1998 p. 427-432 and 3rd ed. 2008 p. 435-439). A discussion of how to use ordinary ML logistic regression software to carry them out on product terms is on p. 199 of Greenland S. “Bayesian methods for epidemiologic research. II. Regression analysis.” Int J Epidemiol 2007;36:195-202.

As discussed in those cites, for all analyses involving higher-order terms it is crucial to properly re-center and re-scale all covariates in these terms as necessary to ensure that main-effect and intercept estimates are stably estimated and have sensible interpretations.

Sample size is automatically accounted for by the relative weighting of the prior and the likelihood function implicit in Bayes’ rule. For a typical mean-zero prior on a higher-order term and little data information on departures main effects, that means the estimated higher-order terms (e.g., their partial-posterior mean or mode) will be close to zero and the final model may not provide very different predictions from the model one gets from selection procedures that kick out higher-order terms with high P-values. When there is more data information on the terms, the predictions will look like a coherent averaging of the all-in and all-out (main effects only) model. Here’s one of my overviews of that perspective: Greenland S. “Multilevel modeling and model averaging”. Scand J Work Environ Health 1999;25 (suppl 4):43-48. And here’s a primer on the shrinkage concepts behind the methods: Greenland S. “Principles of multilevel modelling”. Int J Epidemiol 2000;29:158–167.

3 Likes

Excellent. The scaling and centering can be done “in secret” without being seen by the user. In many applications, Bayesian modeling with Stan involves a hidden centering step with the entire design matrix replaced with the QR decomposition (for orthonormalization), then the posterior draws of parameters are back transformed upon completion of posterior sampling. This behind the scenes work can also be done in the other contexts. The net effect of this, for one example, is that you don’t care about which basis functions you use, i.e., extreme collinearity doesn’t matter. But on the bigger picture I favor any method that results in properly wider uncertainty intervals that accounts for how much we don’t know. Some empirical Bayes methods (e.g. Dersimonian and Laird) underestimate variances. But I’m getting away from the main points …

“The scaling and centering can be done “in secret” without being seen by the user.”

No, sorry, but those hidden steps are insufficient as they only address fitting instabilities, and can still leave the resulting output unintelligible. The scaling and centering I’m talking about needs to be done before fitting and then reported by the user to ensure the final output coefficients aren’t artefactually tiny and the lower-order terms don’t refer to impossible covariate values. E.g. for diastolic blood pressure (DBP) x we’d need to convert from mmHg to cmHg recentered around 80mm=8cm so the coefficients with DBP refer to deviations from 8cm (the clinical reference for normotensive), while for smoking z we’d need to convert from cigs/day to packs/day etc. If we didn’t do that, the xz coefficient would be tiny even if important, since it would refer to tiny fractional units of mm*(1 cigarette), and the smoking coefficient would refer to the effect of 1 cig/day on people with zero blood pressure (i.e., the dead). This is explained on p. 392 of Modern Epidemiology 3rd ed, also in Greenland 2007 (above) p. 199 and p. 93-94 of Greenland and Pearce “Statistical foundations for model-based adjustments”, Annual Reviews of Public Health 2015;36:89-108.

-That is just flat wrong by itself: there’s a 50-year long literature in shrinkage/ridge/Stein/empirical-Bayes/semi-Bayes/penalized/hierarchical (multilevel) regression explaining exactly how to account for the apparent cheating by using data and background information to achieve better estimates than the extremes of all-in and all-out models. And it’s all easy to apply with standard software. It’s even briefly introduced in Modern Epidemiology Ch. 21 (see 2nd ed. 1998 p. 427-432, 3rd ed. 2008 p. 435-439). These methods (when used as indicated, a caution for every method) outperform traditional methods in theory, simulation, and real applications, and so are part of standard statistical toolkits in some areas in engineering - but not in health-med, which often lags by generations in certain branches of statistical methodology (as seen in this very thread). That you were apparently not taught about it in practical depth is indicative of the problem that few grad students acquire skills outside the set possessed by their instructors and mentors.

1 Like

OK, I’ll trust you on this. It certainly goes against my intuition, and I don’t understand the technical details so I can’t claim to believe it on a gut level. But this is a topic where I know that I don’t have a full understanding and that you do, which is sufficient to get me to a 99% confidence that you are right and therefore concede the point. This isn’t really what I was here to discuss anyway, sorry for sidetracking with a discussion on which I may have been wrong.

1 Like

Great, but I hope you don’t just trust me on this and instead someday get some direct knowledge of the topic, because in terms of practical importance for the kind of controversies that have undermined the credibility of epidemiology I think it’s far more general and important than (say) longitudinal causal modeling (as important as that is where needed). Because I put a lot of effort into explaining how these methods work for epidemiology, I’m going to list a sequence of just a few of the dozens of articles I’ve written on the topic, which (after reading the intro passage in Modern Epid 3) may give you a sense of how your intuition went astray and can be upgraded to modern statistics (as usual they are available directly from me if you have any access problems) - at a minimum every epidemiologist with pretenses to understanding modern regression methodology should read the first one:
Greenland S. Principles of multilevel modelling. Int J Epidemiol 2000;29:158–167.
Witte JS, Greenland S, Kim LL, Arab LK. Multilevel modeling in epidemiology with GLIMMIX. Epidemiology 2000;11:684–688.
Greenland S. When should epidemiologic regressions use random coefficients? Biometrics 2000;56:915–921.
Greenland S. Hierarchical regression for epidemiologic analyses of multiple exposures. Environ Health Perspect 1994;102(suppl 8):33–39.
Also, here’s an article directly about how the Bayesian ideas in these methods arise in bias analysis and modeling the relation of ORs to baseline risk (the topic which started this thread):
Greenland S. Sensitivity analysis, Monte-Carlo risk analysis, and Bayesian uncertainty assessment. Risk Analysis 2001;21:579–583.

6 Likes

I handle the scaling problems you mentioned by not interpreting regression coefficients “as they stand” but by forming clinically relevant contrasts. I don’t modify the meaning of model parameters in order to achieve this, since it can be handled on the back-end.

1 Like

That may be OK if you are the analyst and always remember to recenter and rescale each output. But that is more laborious than just doing both upfront, so it is no wonder I see plenty of articles where recentering and rescaling didn’t happen and the tabulated estimates are referring to impossible treatment levels or are unintelligibly small and thus hard to rescale to a reasonable range. That’s why I advise teaching of recentering and rescaling in data preprocessing as part of good practices for analysis-database construction, along with careful documentation of all transforms. It’s also good analytical hygiene as it makes it much easier to spot problems mid-analysis instead of at the back end, such as evidence of coding errors that produced estimates of suspicious size in context; plus it avoids display underflows that can plague outputs involving higher-order terms.

We have different styles on that. I’ve written a lot of easy-to-use backend R tools to handle all of that, without need for reparameterizing/scaling/centering.