Hi all, i’m new here but have taken a lot from these forums over the last months. This is I think a very simple question.
There’s debate in epi regarding the relative utility of odds vs risks in studies with binary outcomes. The former have favourable mathematical properties (not constrained to be between 0 and 1), while the latter are more intuitive and easier to interpret.
Some have suggested rather than estimate odds ratios, one can use log binomial models (which often fail to converge) or poisson models with a robust variance estimator to estimate risk ratios directly. This makes sense to me.
But why can’t we just fit logistic regression models to calculate the fitted odds and then use the inverse logit transformation to calculate risks (probabilities)?? For a continuous exposure, we could then plot the probability of the outcome across the entire range of the exposure. For categorical exposures, this would give us the (adjusted) risks in each group.
But I’ve never seen this done so I presume that it’s incorrect. My intuition about why this is wrong is related to the sigmoidal shape of the logistic function, but I still can’t quite manage to grasp it… If anyone could help me clarify this, it would be much appreciated.
Why ideed. It’s not only the easiest way, it is the most appropriate way. I can never understand why it is a good use of people’s time to have to deal with log binomial or Poisson models for this context.
Huh! I asked a biostatistician about this and they dodged the question and just recommended I use a Poisson model with the sandwhich variance estimator. It lead me to conclude (along with never seeing it done in the literature) that calculating fitted probabilities must be somehow incorrect. It certainly makes the most sense to me so maybe i’ll stick with it. If anyone has any resources/example reports on how to present logistic regression results on the probability scale, I would love to see them (I can’t recall if this comes up in RMS so i’ll check there in the mean time)
It’s in RMS and also see this and several of my blog articles in fharrell.com where I use differences in two predicted probabilities from a logistic model. The use of the log link or Poisson regression for binary Y come from a mistaken belief that risk ratios are good effect measures.
For binary outcomes that are not collapsed from counts, a Poisson or a negative binomial model would be inappropriate because Poisson models require a count outcome. Am I incorrect?
Yes because I tell students that we use Poisson with count variables to model incidence rate ratios. I have not seen it used with a 0, 1 variable before. I guess it could fit a Poisson distribution but would have a very short right skew.
This topic of model-based risk and rate estimation was being covered by authors like Cornfield, Bishop and Fienberg as far back as the 1960s and some say traces back to Deming in the 1940s. By the 1990s there was a sizeable literature on it. A few points I advise taking from that literature:
I am with Frank in that if you have a binomial (or Bernoulli) outcome then the statistically sensible approach is to fit a logistic model (possibly hierarchical, as with a prior or random effects) and then compute what is needed from the fitted values. The general idea is that if the model is just a smoothing or noise-reduction device as is almost always the case in health and med research (rather than an embodiment of physical laws) it’s best to use the most numerically stable and most rapidly converging form available in tested software (where “rapidly converging” means both in asymptotic behavior and in numerical fitting, which seem to go together). The biologic rationales for other models connect only weakly to the messy realities of epidemiologic data, and overlook how little can be learned about the actual biologic mechanisms from model fitting (which cannot substitute for getting more detailed data).
An important counterpoint is that if study of effects is the goal, the fitted model should be rich with enough terms to capture relevant detail, rather than based on the usual misplaced parsimony in deleting terms with p>0.05 or something equally biasing. The point is to not smooth away potentially important data patterns. This is a huge controversial topic so that’s all I’ll say here, but see my 2006 article listed below.
Often one should not stop at presenting exponentiated model coefficients (which usually estimate odds ratios or rate ratios, depending on the sampling design). The fitted logistic probabilities can be easily used to compute estimated risks, risk ratios, risk differences, attributable fractions etc. - whatever is called for by the study context. This is not a statistical choice, but one of topic relevance, e.g., if costs are proportional to risks then risks and their differences are more relevant than odds and their ratios.
In sum, I’ve been writing about this topic since 1979, as have many others before during and since. See p. 439-440 of Rothman et al. Modern Epidemiology 3rd ed. 2008 for a brief intro and a few citations. Here are some of my later articles (available from me if you can’t download them):
Greenland, S. (2004). Model-based estimation of relative risks and other epidemiologic measures in studies of common outcomes and in case-control studies. American Journal of Epidemiology, 160, 301-305. - This gives a general formulation with many citations to the literature on this topic up to 2004.
Greenland, S. (2004). Interval estimation by simulation as an alternative to and extension of confidence intervals. International Journal of Epidemiology, 33, 1389-1397. - This companion article discusses how to use simulation to get model-based CI for arbitrary measures rather than resorting to complex and usually unprogrammed analytic formulas or popular but often unstable and inaccurate nonparametric bootstrap-percentile methods.
Greenland, S. (2006). Smoothing observational data: a philosophy and implementation for the health sciences. International Statistical Review, 74, 31-46. - Gives a general theory for the approach I advise for these purposes.
Thank you so much for this response Dr. Greenland. I will take what you say to heart as I have admired your work for over a decade. I will read up on this using the references you list and hopefully be able to pass some of the insights gained onto my students.
Just so those here have it, I posted at Twitter in reply to Ghement - There’s a literature on handling separation and sparsity in logistic regression going back many decades. Here’s a few more recent examples, which have further cites: https://www.bmj.com/content/352/bmj.i1981
wouldn’t they use poisson for a binary outcome to account for varying exposure? ie as on offset variable? because as someone notes here, you cannot use logistic regression in this case: Logistic regression | varying exposure variable?
Logistic regression can be used on grouped data but is really intended to be used on raw individual 0/1 values. Can you do what you want to do using the most primitive form of the data?
yes, i should say: i’m thinking about a particular scenario, ie the integrated safety summary submitted to a regulatory authority where the occurrence of adverse events is analysed as 0/1 but the studies included in the report have quite different follow-up and exposure, and we want to account for this inconsistent exposure. This leads me to Poisson with an offset variable of log(exposure). I was just wondering about the popularity of Poisson for analysing 0/1 and thought maybe it is inspired by situations like this where logistic regression seems inadequate
I’ve never seen Poisson used with an offset and a per-observation denominator of 1. That doesn’t mean it’s impossible to work but I’m suspicious. Also the exposure may vary from patient to patient, requiring varying censoring times thus the use of something like a Cox PH model.