Should one derive risk difference from the odds ratio?

R_cubed · June 28, 2021, 6:06pm

If I understand this correctly, your view would be contrary to known results in decision theory where the admissible frequentist procedures are in the class of Bayes’ procedures. Gelman goes into detail about bias/variance trade-offs here. here:

Blockquote
How can this be? Where does Bayes come in? For a Bayesian, the problem with the “bias” concept is that is conditional on the true parameter value. But you don’t know the true parameter value. There’s no particular virtue in unbiasedness.

AndersHuitfeldt · June 28, 2021, 6:15pm

I’m willing to have this discussion in another thread, if you make one. It would require me to move out of my comfort zone into areas outside my expertise, with much higher risk that I will say something wrong. That could still be an interesting, something I might learn from, but I would prefer clarifying that it has no connection with this manuscript

s_doi · June 28, 2021, 6:36pm

Agree with R_cubed. Take the case of meta-analysis - any empirical set of weights must introduce bias (over natural weights) but we accept that because we only do a MA once and therefore we are chasing the bias-variance trade-off. If this was not the case and we simply want unbiased results - the arithmetic mean of study estimates would be fine - I believe Shuster argued for this in Stat Med a while back but that was incorrect (my view)

AndersHuitfeldt · June 28, 2021, 6:43pm

I find the structure of this argument very strange. You seem to be arguing that because the statistical community has settled on a form of a meta-analysis that introduces bias, therefore accepting bias is sometimes the right thing to do.

Maybe the statisticians are right. If so, reproduce their argument instead of just showing what conclusion they arrived at.

f2harrell · June 28, 2021, 7:49pm

It is another issue worth exploring in another thread. The Bayesian skeptical prior approach of the interaction being “half in and half out” of the model comes into play. To me the most rational way to deal with the variance-bias tradeoff. Some general idea of needed sample size before any interactions should be added to a model would be helpful.

Sander · June 29, 2021, 2:22am

I think the concepts and methods for all this were pretty much worked out long ago. By the 1990s there was a literature on allowing product terms and nonlinear trend terms “partially” in the model using estimated (empirical-Bayes) or partially-specified (semi-Bayes) priors on the terms. A key advantage of these partial-Bayes or partially penalized methods is that they do not require complete prior specification, and their theory and computation fall neatly under hierarchical (multilevel) and penalized regression. They were briefly discussed in Modern Epidemiology Ch. 21 (in both the 2nd ed. 1998 p. 427-432 and 3rd ed. 2008 p. 435-439). A discussion of how to use ordinary ML logistic regression software to carry them out on product terms is on p. 199 of Greenland S. “Bayesian methods for epidemiologic research. II. Regression analysis.” Int J Epidemiol 2007;36:195-202.

As discussed in those cites, for all analyses involving higher-order terms it is crucial to properly re-center and re-scale all covariates in these terms as necessary to ensure that main-effect and intercept estimates are stably estimated and have sensible interpretations.

Sample size is automatically accounted for by the relative weighting of the prior and the likelihood function implicit in Bayes’ rule. For a typical mean-zero prior on a higher-order term and little data information on departures main effects, that means the estimated higher-order terms (e.g., their partial-posterior mean or mode) will be close to zero and the final model may not provide very different predictions from the model one gets from selection procedures that kick out higher-order terms with high P-values. When there is more data information on the terms, the predictions will look like a coherent averaging of the all-in and all-out (main effects only) model. Here’s one of my overviews of that perspective: Greenland S. “Multilevel modeling and model averaging”. Scand J Work Environ Health 1999;25 (suppl 4):43-48. And here’s a primer on the shrinkage concepts behind the methods: Greenland S. “Principles of multilevel modelling”. Int J Epidemiol 2000;29:158–167.

f2harrell · June 29, 2021, 12:43pm

Excellent. The scaling and centering can be done “in secret” without being seen by the user. In many applications, Bayesian modeling with Stan involves a hidden centering step with the entire design matrix replaced with the QR decomposition (for orthonormalization), then the posterior draws of parameters are back transformed upon completion of posterior sampling. This behind the scenes work can also be done in the other contexts. The net effect of this, for one example, is that you don’t care about which basis functions you use, i.e., extreme collinearity doesn’t matter. But on the bigger picture I favor any method that results in properly wider uncertainty intervals that accounts for how much we don’t know. Some empirical Bayes methods (e.g. Dersimonian and Laird) underestimate variances. But I’m getting away from the main points …

Sander · June 29, 2021, 3:16pm

“The scaling and centering can be done “in secret” without being seen by the user.”

No, sorry, but those hidden steps are insufficient as they only address fitting instabilities, and can still leave the resulting output unintelligible. The scaling and centering I’m talking about needs to be done before fitting and then reported by the user to ensure the final output coefficients aren’t artefactually tiny and the lower-order terms don’t refer to impossible covariate values. E.g. for diastolic blood pressure (DBP) x we’d need to convert from mmHg to cmHg recentered around 80mm=8cm so the coefficients with DBP refer to deviations from 8cm (the clinical reference for normotensive), while for smoking z we’d need to convert from cigs/day to packs/day etc. If we didn’t do that, the xz coefficient would be tiny even if important, since it would refer to tiny fractional units of mm*(1 cigarette), and the smoking coefficient would refer to the effect of 1 cig/day on people with zero blood pressure (i.e., the dead). This is explained on p. 392 of Modern Epidemiology 3rd ed, also in Greenland 2007 (above) p. 199 and p. 93-94 of Greenland and Pearce “Statistical foundations for model-based adjustments”, Annual Reviews of Public Health 2015;36:89-108.

Sander · June 29, 2021, 3:48pm

-That is just flat wrong by itself: there’s a 50-year long literature in shrinkage/ridge/Stein/empirical-Bayes/semi-Bayes/penalized/hierarchical (multilevel) regression explaining exactly how to account for the apparent cheating by using data and background information to achieve better estimates than the extremes of all-in and all-out models. And it’s all easy to apply with standard software. It’s even briefly introduced in Modern Epidemiology Ch. 21 (see 2nd ed. 1998 p. 427-432, 3rd ed. 2008 p. 435-439). These methods (when used as indicated, a caution for every method) outperform traditional methods in theory, simulation, and real applications, and so are part of standard statistical toolkits in some areas in engineering - but not in health-med, which often lags by generations in certain branches of statistical methodology (as seen in this very thread). That you were apparently not taught about it in practical depth is indicative of the problem that few grad students acquire skills outside the set possessed by their instructors and mentors.

AndersHuitfeldt · June 29, 2021, 3:54pm

OK, I’ll trust you on this. It certainly goes against my intuition, and I don’t understand the technical details so I can’t claim to believe it on a gut level. But this is a topic where I know that I don’t have a full understanding and that you do, which is sufficient to get me to a 99% confidence that you are right and therefore concede the point. This isn’t really what I was here to discuss anyway, sorry for sidetracking with a discussion on which I may have been wrong.

Sander · June 29, 2021, 4:20pm

Great, but I hope you don’t just trust me on this and instead someday get some direct knowledge of the topic, because in terms of practical importance for the kind of controversies that have undermined the credibility of epidemiology I think it’s far more general and important than (say) longitudinal causal modeling (as important as that is where needed). Because I put a lot of effort into explaining how these methods work for epidemiology, I’m going to list a sequence of just a few of the dozens of articles I’ve written on the topic, which (after reading the intro passage in Modern Epid 3) may give you a sense of how your intuition went astray and can be upgraded to modern statistics (as usual they are available directly from me if you have any access problems) - at a minimum every epidemiologist with pretenses to understanding modern regression methodology should read the first one:
Greenland S. Principles of multilevel modelling. Int J Epidemiol 2000;29:158–167.
Witte JS, Greenland S, Kim LL, Arab LK. Multilevel modeling in epidemiology with GLIMMIX. Epidemiology 2000;11:684–688.
Greenland S. When should epidemiologic regressions use random coefﬁcients? Biometrics 2000;56:915–921.
Greenland S. Hierarchical regression for epidemiologic analyses of multiple exposures. Environ Health Perspect 1994;102(suppl 8):33–39.
Also, here’s an article directly about how the Bayesian ideas in these methods arise in bias analysis and modeling the relation of ORs to baseline risk (the topic which started this thread):
Greenland S. Sensitivity analysis, Monte-Carlo risk analysis, and Bayesian uncertainty assessment. Risk Analysis 2001;21:579–583.

f2harrell · June 29, 2021, 7:32pm

I handle the scaling problems you mentioned by not interpreting regression coefficients “as they stand” but by forming clinically relevant contrasts. I don’t modify the meaning of model parameters in order to achieve this, since it can be handled on the back-end.

Sander · June 29, 2021, 8:48pm

That may be OK if you are the analyst and always remember to recenter and rescale each output. But that is more laborious than just doing both upfront, so it is no wonder I see plenty of articles where recentering and rescaling didn’t happen and the tabulated estimates are referring to impossible treatment levels or are unintelligibly small and thus hard to rescale to a reasonable range. That’s why I advise teaching of recentering and rescaling in data preprocessing as part of good practices for analysis-database construction, along with careful documentation of all transforms. It’s also good analytical hygiene as it makes it much easier to spot problems mid-analysis instead of at the back end, such as evidence of coding errors that produced estimates of suspicious size in context; plus it avoids display underflows that can plague outputs involving higher-order terms.

f2harrell · June 29, 2021, 9:13pm

We have different styles on that. I’ve written a lot of easy-to-use backend R tools to handle all of that, without need for reparameterizing/scaling/centering.

Sander · June 29, 2021, 10:05pm

“I’ve written a lot of easy-to-use backend R tools to handle all of that, without need for reparameterizing/scaling/centering.”

You are confusing what you do and can do with what typical users do and can do. At most I’d say a conscientious project ought to have someone in charge of database management who is attending to necessary cleaning, logical and biological checking, and essential construction of analysis variables (like BMI, which is already the higher-order term weight/height^2, with attendant problems; see Michels, Greenland & Rosner. “Does body-mass index adequately capture the relation of body mass and body composition to health outcomes?”, Am J Epidemiol 1998;147:167-172). In my experience that data manager can much more easily deal with recentering and rescaling before regression and have it be done for good than anyone can afterward.

Or do you really think typical users know how to transform treatment main effects and their SEs to account for the shifting covariate-specific intercepts when treatment is in a product term? That isn’t an issue at all if the recentering and rescaling are done before regression; the program will then take care of it automatically. Even if your R proc does all the necessary calculations (does it recalculate SEs?), do you think most researchers or their assistants are going to know and use R or have someone on their payroll who knows and R and of your procs and has time to port their outputs into your procs or write their own for that? Or that they should trust that the R procs have been as thoroughly field tested as SAS or Stata to ensure important bugs have been caught and removed?

Finally, please don’t mix-up reparameterization with recentering and rescaling - even though the latter can be viewed as trivial special cases of reparameterization, they are much more basic operations to maintain interpretability of variables, coefficients, and program outputs.

f2harrell · June 30, 2021, 10:57am

This is something I have specialized in for many decades and wholeheartedly disagree with you. The tools I’ve developed deal with the issues you listed. For example contrast(fit, list(age=50, treatment='A'), list(age=50, treatment='B')) will provide the needed contrast independent of whether age is linear or age and treatment interact. I don’t like the practice of creating datasets that do what you mentioned on the front end. Some like to center on the median instead of the mean. Tools in the R rms package are coding-independent. I know that you will need to have the last word on this but I’ll leave it at that.

Sander · June 30, 2021, 3:33pm

Damn right I’ll go for the last word, because this apparent side issue brings us back full circle to the underlying tension behind the odds ratio debate. It’s a tension squaring off statisticians who prioritize numerical and computational properties over contextual and causal properties against those who place scientific context over decontextualized statistical abstractions.

Also, I’ve specialized in the centering/scaling issue for at least as long as you and have publications back to the mid-1980s to prove it, so don’t try to argue from authority with me (not that you should do that with anyone). And on that topic, what are you “wholeheartedly disagreeing” with and why? What exactly is the evidence that doing it at the front end is worse than doing it at the back end? As far as I can see you are just being defensive in the usual “that’s how I’ve always done it” way of senior statisticians, and as with the OR debate turning a blind eye to the clinical causal perspective.

We both agree that recentering and rescaling is essential at some point whether front or back end, right? I am simply saying my observation from decades of teaching is that students and clinicians (most of whom were definitely not Frank Harrell) find it much easier to comprehend these transforms when those are done first, because they can understand that (for example) a 1mm change in blood pressure is not clinically meaningful but 1cm is, so it is better to reframe the entire analysis in meaningful, simple, transportable units. This is a cognitive observation, not an algebraic or statistical issue as you seem to be making it out.

Which brings me to your comment that “some like to center on the median instead of the mean”. That raises the separate issue of transportability of estimates, which was thoroughly botched by the statistical community for over half a century and only recently worked out correctly - by the causal-inference community (specifically by epidemiologic methodologists and computer scientists, not mainstream statisticians). When product terms are in play, transportability tells us to center at a standard clinical reference level like 80mmHg for blood pressure, because that ensures the main treatment effect refers to a level understood by clinicians. Sample means, medians etc. vary across studies, hence centering based on the data harms transportability of the estimated main effects. The same objection applies to rescaling by SDs instead of clinically meaningful units, as arises with “standardized” coefficients - an objection made by Tukey in the 1950s!

Statisticians seem averse to using contextually or clinically meaningful and transportable centering and scaling. I think that’s because it requires more attention to the topic and less to the particular sample at hand, and requires the statistician to surrender some authority to the clinical context. Letting instead the sample numbers drive these choices as statisticians seem to prefer is in my view replacing contextual information with statistical numerology (which sadly characterizes much statistical advice and practice, as is seen with claims of superiority of odds ratios for measuring effects). There are technical advantages to mean recentering (e.g., it improves estimator component orthogonality, thus minimizing numeric instabilities and speeding convergence) but as you noted that can be done hidden inside the fitting software and undone before the program output.

Ian_Shrier · July 2, 2021, 11:08am

I am thinking this discussion could result in a great book at some time in the future to help everyone understand how to resolve discrepant perspectives - if that ever occurs in this post. Most of the statistical arguments above are way above my training and I will return to basics.

There is some true data generating process (DGP). Frank has the default that this should be homogeneous based on experience in clinical trials that interactions rarely occur. Sander points out the lack of power for interactions in trials. And there have been comments about the mechanisms of homogeneity, etc.

Let’s use a simple example of biochemical example of mixing substrates and enzymes in a solution to obtain a product. The rate of conversion is temperature dependent. Let’s say we double the temperature from 10deg to 20deg and the rate doubles. And when we double the temperature from 20 deg to 40 deg, the rate again doubles. Now, what is the true DGP? We are likely tempted to say it is a multiplicative DGP with rate = k + 2*temp. Fair enough. But what if we now measure temperature on the Fahrenheit or Kelvin scale? Our formula becomes much more complicated and requires interactions. Note that we haven’t even discussed changing the outcome scale, just the scale of the predictor.

If we want to know accumulation of product, we need to add a formula for its clearance in addition to the one above for its production. Once that is done, we have the change in concentration. If one wants binary variables, we can consider dichotomizing the concentration to being above or below a particular threshold that will cause disease.

I use this example for a couple of reasons. First, human biology is based on substrates and enzymes, and exposed to temperature so it is directly related to functions within cells. To link this back to terms more relevant in epidemiology, we can consider that cancer occurs when certain types of DNA damage are present. Rate of DNA damage = rate of production of product, and DNA repair = rate of clearance. DNA damage present is dependent on the difference in these rates.

Second, it shows that much of our conclusions begin with the most elementary assumptions that we often overlook. One of the things I think Jamie Robins has emphasized, and is too often overlooked, is the importance of telling the story that you think underlies the data generating process. Maybe some of the differences above could be resolved by each person clearly telling the story that leads to their underlying their DGP. I had originally written much more below and then deleted it. My post is already too long. I would just like to read the stories. I personally find it easy to come up with stories for DGPs that act as a multiple of risks (RR), or the summation of risks (RD) when thinking at the biological level. But I find it harder to come up with stories for DGP that act as multiple of odds at the biological level with the exception of exponential decay rates for radiation. But that might just be my lack of knowledge or understanding.

AndersHuitfeldt · July 2, 2021, 11:34am

Would you be able to formalize the biological story for a data generating process that leads to stability of the risk difference? Or stability of the risk ratio when the exposure increases incidence of the outcome?

I am not saying it is impossible, but I will claim I have never seen a convincing example. I would be surprised if one exists.

I have seen several claims for a biological story that results in stability of the risk difference. These are often taught in epidemiology courses and are based on causal pie models. In my view, this argument only works when exposure increases incidence, and it is really a subtly incorrect argument for stability of the survival ratio (which will be approximately correct for the risk difference when the outcome is rare). The general version of this argument is contained in our preprint at [2106.06316] Shall we count the living or the dead?

Sander · July 2, 2021, 2:14pm

Anders: Since at least the 1920s there have been mechanistic models that lead to additive risk differences, and others that lead to multiplicative risk ratio models. Some simple basics derived from potential-outcome models (and known for generations before) are covered in Ch. 5 of Modern Epidemiology 3rd ed. 2008. By the 1950s there was the Armitage-Doll multi-stage carcinogenesis model which for uncommon diseases produced rate additivity when the two factors acted only at one and the same stage, and multiplicativity when they each acted at a stage distinct from the other; there were other implications such as power laws for age incidence (as observed for typical carcinomas) under certain conditions. By the 1980s there was a fair bit of literature on these topics, e.g., see Weinberg CR, “Applicability of the simple independent-action model to epidemiologic studies involving two factors and a dichotomous outcome”. Am J Epidemiol 1986;123:162–173.

I’ve noticed your posts indicate you know next to nothing about the literature that preceded your student days, and from that commit the fallacy of writing as if various inquiries and results didn’t exist before you came along (very millenial, according to one stereotype). That is pretty much the norm - as Kuhn lamented, much science teaching is ahistorical or grossly oversimplified, with emphasis on a few famed main branches and mythologized individuals, pretending as if the copious side branches never existed. Hence most writers write as if what happened to be covered in their program is all that ever existed (and sometimes, especially regarding statistics, as if what they learned is all that anyone will ever really need to know).

But in reality, thanks to the information explosion that has followed from the population, economic, publishing and computing explosions since the second world war, no teaching program or person covers or knows of more than a tiny fraction of what has been done, even in specialty fields like epidemiology. The best we can do is recognize that the probability no one studied an issue before we learned of it keeps exponentially diminishing in tandem with that information explosion.

I will however give you credit for noting (as I have) the apparent absence of mechanistic models that generate a homogeneous causal odds ratio. Even if there were such a mechanism in nature, unlike additive mechanisms its existence would be obscured by odds-ratio noncollapsibility.