Risk Difference with: Normal likelihood + identity link + robust variance estimator?

In this interesting paper, authors argued:

For the risk difference, one may use a GLM with a Gaussian (i.e., normal) distribution and identity link function, or, equivalently, an ordinary least squares estimator. Doing so will return an exposure coefficient that can be interpreted as a risk difference. However, once again, the robust variance estimator (or bootstrap) should be used to obtain valid standard errors.

They provide an R code along with the article.

Is this approach valid? I had never heard about it before.

I know @f2harrell prefers using the binomial likelihood with the logit link to estimate the conditional risk difference.

Perhaps I’m misunderstanding the authors, but despite referencing @Sander in their paper, they recommend OLS on the risk difference, a variable which must be constrained to a finite range, since it is the difference between 2 probabilities. In this post, he agreed with Frank that the logistic is useful in a wide number of scenarios.

Blockquote
The fitted logistic probabilities can be easily used to compute estimated risks, risk ratios, risk differences, attributable fractions etc. - whatever is called for by the study context. This is not a statistical choice, but one of topic relevance, e.g., if costs are proportional to risks then risks and their differences are more relevant than odds and their ratios.

In the OR vs RR mega-thread, he had this comment:

Blockquote
You know Frank I agree completely with your response and have said the same thing to colleagues who have misguidedly promoted use of log-linear risk or (worse) linear risk models. In fact I’ve been advocating our shared view on that since the 1970s (although, as I cited earlier, I have encountered exceptions in pair-matched cohorts in which log-linear risk models outperformed logistic models for a common outcome).

Some further criticism of linear models on probabilities:

1 Like

This approach is very common in econometrics work where they call it a linear probability model. I don’t think there is a reason to believe it would transport very well to a new population, but it does provide a consistent estimate of the risk difference (and conditional mean). Lots of more interesting reading online but generally it seems econometrics folks tend to argue that it doesn’t require assumption of logistic distribution (models conditional mean directly, does not require assumption of normal dist), that the parameters are at worst accurate estimates of true parameters projected to conditional mean, and that it leads to interpretable coefficients out of the box, particularly if you’re interested in interactions. People who don’t like the model tend to dislike that it can make impossible predictions, isn’t expected to transport well to new but similar populations, and that it violates the entire approach to modeling that focuses on trying to explain the whole distribution of data.

1 Like

One note about that giles post is that when he states LPM is biased/inconsistent he does it by comparing LPM (which averages over all covariate values) to marginal effect at the mean so I wouldn’t say they are really estimating the same target there. I think most people who are considering LPM are usually interested in the former quantity and when they say it’s consistent they mean consistent when compared to the estimated RD you get if you used your logistic regression model to predict mean for all patients with trt = 0, and then again with trt = 1 and compare the two.

Toy example:

library(dplyr)

n <- 1000000
x <- rnorm(n)
x2 <- rnorm(n)
t <- rbinom(n, 1, plogis(qlogis(0.5) + log(2)*x)) # x is confounder
p <- plogis(qlogis(0.2) + log(3)*t + log(1.5)*x + log(2)*x2) # p
r <- rbinom(n, 1, p) # r is binomial

dat <- data.frame(r,t,x,x2) 

m.log <- glm(r ~ t + x + x2, family = binomial) # logistic regression (true model)

# Generate data for standardization
p.dat0 <- dat %>% dplyr::mutate(t = 0)
p.dat1 <- dat %>% dplyr::mutate(t = 1)


m.log0 <- mean(predict(m.log, type = "response", newdata = p.dat0)) 

m.log1 <- mean(predict(m.log, type = "response", newdata = p.dat1))

m.log1 - m.log0 # Risk difference from logistic regression averaged over all patients
lm(r ~ t + x + x2) # LPM risk difference
lm(r ~ t) # Failure to adjust for confounder gives biased answer


2 Likes

A dreadful idea. Why every fit a model that can’t fit the data?

3 Likes

Thanks R-cubed for bringing in my post from June 2021. I agree with Frank and others that linear models are bad choices for risks. As I mentioned however, there are situations where log-linear risk models (as opposed to logistic = loglinear odds models) can perform well if they are parameterized to produce variation independence between the risk ratios and odds products (rather than in the traditional GLM parameterization). In that form they can provide doubly robust effect estimators and avoid the noncollapsibility objections to odds ratios. For an introduction to the topic see Richardson et al. JASA 2017,

Recent reviews and generalizations include Yin et al. Biometrika 2022,

and Wang et al. Biometrics 2022,
https://onlinelibrary.wiley.com/doi/full/10.1111/biom.13687
I think this risk-ratio based alternative to logistic modeling deserves attention for analyses that focus on robust and interpretable effect estimation (rather than risk prediction), especially when treatment is a longitudinal regime.

3 Likes

Sander influenced me to write Avoiding One-Number Summaries of Treatment Effects for RCTs with Binary Outcomes | Statistical Thinking which suggests a general approach: use a model with more flexibility. The easiest such model is a LRM with lots of interaction terms. Use sufficient penalization (or Bayesian priors). Estimate the distribution of OR, RR, RD over all subjects from that mode…

5 Likes

Thanks Frank for pointing to that post, which I had missed.
Two minor items in it I’d fix:

  1. Typo: “…receive little absolute benefit form…”
  2. Since the 1970s I’ve campaigned to replace “interaction term” with “product term” in modeling. Product terms are what you are discussing in the post. Those are extremely dependent on the choice of link function, but are statistically identifiable (albeit very imprecisely in most examples). Calling them “interactions” suggests and gets them confused with physiological (mechanical or causal) interactions, whose presence does not depend on the link function we happen to choose and which are not necessarily identifiable without extra mechanistic assumptions. This distinction ought to be at least mentioned.
3 Likes

Good points which I’ll fix. I make this clear in my full RMS book and course where I say that product terms are sometimes oversimplified conveniences for general interaction concepts.

I appreciate the time you took to write up that simulation code to demonstrate the most charitable case for the linear probability model.

My criticism is similar to my perspective on the simulations trying to justify the use of metric models on ordinal data: It is the least favorable distribution that is relevant, because the assumption of normality (of the confounder in your example) cannot be verified in practice.

Even proponents concede it generates impossible predictions and fall back on the “interpretable inference” argument. Think about trying to perform a research synthesis on a set of papers that use a method that does not claim to generate accurate predictions. No scientific community that instituted any incentives for reproducibility (ie. betting on future replications) would converge upon the linear probability model.

In other posts, Giles points out even if one desires a marginal estimate, the LPM cannot identify the correct parameters if there is even one error in any of the classifications of Y.

Quote:

So, ask yourself the following question:

“When I have binary choice data, can I be absolutely sure that every one of the observations has been classified correctly into zeroes and ones?”

If your answer is “Yes”, then I have to say that I don’t believe you. Sorry!

If your answer is “No”, then forget about using the LPM. You’ll just be trying to do the impossible - namely, estimate parameters that aren’t identified.

And that’s not going to impress anybody!

Further Reading

3 Likes