Testing matched (paired) data with simple logistic regression

tamas.ferenci · April 24, 2025, 10:28am

Consider a simple example where we investigate the sensitivity of two diagnostic tests by testing the same patients known to be sick using both tests. (For simplicity, I’ll assume everywhere that diagnosis is binary.) Let’s assume that the true value of sensitivity is 0.5 for both tests (i.e., H_0 holds.)

What is the appropriate way to statistically compare the tests? Here are a few options I could think of, allowing potentially correlating values of the two tests:

library(survival)
library(data.table)
library(ggplot2)

simul <- function(n, r) {
  SimData <- data.frame(Subject = rep(paste0("Subject", 1:n), 2),
                        Test = c(rep("Test1", n), rep("Test2", n)),
                        #Result = sample(c("-", "+"), 2*n, replace = TRUE))
                        Result = c(mvtnorm::rmvnorm(
                          n, mean = c(0, 0),
                          sigma = matrix(c(1, r, r, 1), ncol = 2)) > 0))
  SimTab <- table(SimData[SimData$Test == "Test1",]$Result,
                  SimData[SimData$Test == "Test2",]$Result)
  fit <- glm(Result ~ Test, data = SimData,
             family = binomial(link = "logit"))
  c(McNemar = mcnemar.test(SimTab)$p.value,
    McNemarNoContCorr = mcnemar.test(SimTab, correct = FALSE)$p.value,
    ExactMcNemar = exact2x2::exact2x2(SimTab, tsmethod = "central",
                                      paired = TRUE)$p.value,
    CondLogReg = summary(clogit(Result ~ Test + strata(Subject),
                                data = SimData)
    )$coefficients["TestTest2", "Pr(>|z|)"],
    LogReg = summary(fit)$coefficients["TestTest2", "Pr(>|z|)"],
    LogRegClusterBS = lmtest::coeftest(
      fit, sandwich::vcovBS(fit, cluster = ~ Subject)
    )["TestTest2", "Pr(>|z|)"],
    LogRegClusterHC3 = lmtest::coeftest(
      fit, sandwich::vcovCL(fit, cluster = ~ Subject, type = "HC3")
    )["TestTest2", "Pr(>|z|)"])
}

pargrid <- CJ(n = c(24, 50, 100, 200),
              r = c(0, 0.3, 0.5, 0.7, 0.9))

cl <- parallel::makeCluster(parallel::detectCores() - 1)
parallel::clusterExport(cl, c("simul", "pargrid"))
parallel::clusterEvalQ(cl, library(survival))

SimRes <- rbindlist(lapply(1:nrow(pargrid), function(i) {
  data.table(n = pargrid$n[i], r = pargrid$r[i],
             t(parallel::parSapply(cl, 1:1e4, function(j)
               simul(pargrid$n[i], pargrid$r[i]))))
}))

parallel::stopCluster(cl)

Let’s see the results for uncorrelated tests with a sample size of 24:

GGally::ggpairs(SimRes[r == 0 & n == 24,
                       .(McNemar, McNemarNoContCorr, ExactMcNemar,
                         CondLogReg, LogReg, LogRegClusterBS,
                         LogRegClusterHC3)],
                diag = list(continuous = "barDiag"),
                lower = list(continuous = GGally::wrap(
                  "points", size = 0.1, alpha = 0.3)))

And (again for uncorrelated tests, with a sample size of 24):

cbind(colMeans(SimRes[r==0 & n == 24] < 0.05))

McNemar           0.0255
McNemarNoContCorr 0.0466
ExactMcNemar      0.0262
CondLogReg        0.0302
LogReg            0.0617
LogRegClusterBS   0.0357
LogRegClusterHC3  0.0447

Now, I find this very surprising:

Simple logistic regression does not account for measurements coming from the same patient, so I assumed it’ll be horribly invalid. Instead it turned out to be acceptable – actually, even better than McNemar’s test which specifically adjusts for this!
What is probably even more surprising is that adjusting logistic regression for the clustering actually made the situation worse…! (And, as a sidenote, bootstrap seems to be worse than HC3.)
Just a minor remark compared to these, but I was surprised about the results of conditional logistic regression too – I was under the impression that it is identical to McNemar’s test, but clearly not.

What is the explanation for these? Or I made a mistake in the simulation…?

Update. As pointed out in the comments, the fact that simple logistic regression (which does not correct for correlation) works well is not surprising, as I haven’t assumed that the tests are correlated. (Although it is still strange that it is better than McNemar’s test, or that the correction makes it worse.) Let’s see what de we have for different correlations, with a sample size of 24:

dcast(melt(SimRes[n == 24], id.vars = c("n", "r"))[
  ,.(mean(value < 0.05, na.rm = TRUE)),
  .(n, r, variable)], variable ~ n + r, value.var = "V1")

            variable   24_0 24_0.3     24_0.5      24_0.7     24_0.9
              <fctr>  <num>  <num>      <num>       <num>      <num>
1:           McNemar 0.0255 0.0191 0.01910191 0.013513514 0.00317623
2: McNemarNoContCorr 0.0466 0.0416 0.04450445 0.048048048 0.04098361
3:      ExactMcNemar 0.0262 0.0193 0.01910000 0.013500000 0.00310000
4:        CondLogReg 0.0302 0.0183 0.01060106 0.003403403 0.00000000
5:            LogReg 0.0617 0.0350 0.02350000 0.009000000 0.00030000
6:   LogRegClusterBS 0.0357 0.0350 0.03820000 0.041300000 0.03230000
7:  LogRegClusterHC3 0.0447 0.0413 0.04450000 0.048114434 0.04325085

Now, what I see here:

Logistic regression indeed gets (much) worse with correlation, so that’s a good news.
Adjusting it for clustering indeed corrects it (even for the most extreme cases of correlation), at least to that extent what we have seen without correlation. So that’s again good news.
HC3 now definitely seems to be better than bootstrap.
The problem is with McNemar: it gets horribly invalid even for moderate correlations. That’s practically impossible as it specifically takes pairing into account, so I suspect an error in the simulation (although I couldn’t find it…).

Update #2. Another point that came up within the comments is that sample size might be relevant, in particular, n=24 is small, which might have an important impact on the result. (Although I thought continuity corrected or exact McNemar’s test should work for small samples too!) Nevertheless, I carried out a more extensive analysis involving different sample sizes:

res <- melt(SimRes, id.vars = c("n", "r"))[
  ,.(mean(value < 0.05, na.rm = TRUE)),
  .(n, r, variable)]

One can get the numerical results with dcast(res, n + variable ~ r, value.var = "V1"), but I now rather plot it:

ggplot(res, aes(x = r, y = V1, group = variable, color = variable)) +
  facet_wrap(~n) + geom_line() + geom_point() +
  geom_hline(yintercept = 0.05)

Assuming that there are no bugs in my simulation, a couple of very interesting observations can be made:

Now everything is clear regarding my initial question: simple logistic regression works well only for no correlation, and gets horrible wrong with correlation.
What is very surprising (at least to me) that for small samples, the same is true for McNemar’s test, for exact McNemar’s test and for the conditional logistic regression.
The second thing that I don’t understand at all (to the point that I feel it might be a bug) is that neither McNemar’s test, nor its exact version is good even with a sample size of n=200. Yes, they’re much better than for small samples (in particular, they won’t be really affected by correlation), but far from being OK even with n=200.
The third thing that is very surprising is that switching continuity correction off makes McNemar’s test much-much better: it works reasonably well for all investigated correlation and samples sizes.
Perhaps the most surprising is that the overall “winner” is pretty clearly logistic regression with an HC3 adjustment for clustering – this proved to be the best for almost every correlation and sample size (i.e., it surpassed the purpose-built McNemar’s test).
It might be just a very strange coincide, but I thought I mention that for larger samples sizes logistic regression + HC3 and non-continuity corrected McNemar’s test were almost exactly identical (the binary decisions were the same in every simulation out of 10,000 and the correlation of their p-values was 0.9999976 for the sample size of 200). That’s pretty peculiar as – I assumed – they are totally different approaches.

f2harrell · April 24, 2025, 11:28am

Surprising indeed. I hope you get some answers. Beautiful R code by the way.

Louis_Martin · April 24, 2025, 10:42pm

Interesting question! I’m no expert on the matter, but I noticed that you didn’t include family = binomial in glm(). When I made that change and ran your code, I got the following results:

McNemar           0.0233
McNemarNoContCorr 0.0435
ExactMcNemar      0.0241
CondLogReg        0.0277
LogReg            0.0588
LogRegCluster     0.0349

Here, the logistic regression adjusted for clustering is more similar to the other methods that account for data pairs.

JiaqiLi · April 25, 2025, 1:15am

I’m not an expert, but I’d like to try contributing to the discussion. One thought I had is that the way the data is coded doesn’t seem to establish any correlation between Test1 and Test2 for the same individual — they appear to be independent rather than paired. That might explain why the simple logistic regression performs reasonably well.

SimRes <- replicate(10000, {
  SimData <- data.frame(Subject = rep(paste0("Subject", 1:n), 2),
                        Test = c(rep("Test1", n), rep("Test2", n)),
                        Result = sample(c("-", "+"), 2*n, replace = TRUE))
test1 <- mean(SimData[SimData$Test == "Test1",]$Result == "+")
test2 <- mean(SimData[SimData$Test == "Test2",]$Result == "+")
c(test1, test2)
}, simplify = FALSE)
SimRes <- do.call(rbind, SimRes)
cor(SimRes[, 1], SimRes[, 2])
[1] -0.0009640829

Also, I suspect that the sample size of 24 might be too small. I tried increasing it to 50 (with 1000 repetitions, since running 10,000 was too slow on my computer), and themcnemar.test(SimTab, correct = FALSE) also seemed to perform well.

（I didn’t add seeds and ran it a few more times）

McNemar           0.028
McNemarNoContCorr 0.047
ExactMcNemar      0.028
CondLogReg        0.037
LogReg            0.052
LogRegCluster     0.055

McNemar           0.024
McNemarNoContCorr 0.044
ExactMcNemar      0.026
CondLogReg        0.035
LogReg            0.047
LogRegCluster     0.049

McNemar           0.027
McNemarNoContCorr 0.055
ExactMcNemar      0.027
CondLogReg        0.047
LogReg            0.055
LogRegCluster     0.056

I’m not very familiar with the other methods, so I can’t say much about those. I’m not sure if my thoughts are correct, but just wanted to share them.

tamas.ferenci · April 25, 2025, 12:57pm

Thank you very much, Frank!

tamas.ferenci · April 25, 2025, 1:00pm

Oh my God, what an amateur mistake! Thank you very much! I corrected the code (and also parallelized it, so that I could increase the number of simulations to get more reliable results – it doesn’t really seem to have any meaningful effect, results are pretty similar to yours, so 1e4 was probably already enough for stable results).

Bottomline: you’re right, cluster-adjusted logistic regression is now closer to the remaining options. Nevertheless, why (unadjested) logistic regression is the best, and why adjustment makes it worse remains an open question…

timdisher · April 25, 2025, 2:34pm

Both parts of @JiaqiLi’s explanation make sense to me.

Unadjusted logistic regression should perform well in this scenario because there’s no correlation in observations within subjects. I’d expect if you simulated a true disease state for each individual and then each test as a function of that then you would end up with logistic regression have p-values that are too small. The vcovBS helpfile points to Zeileis for some simulations that are helpful here. For example looking at experiment 2 we don’t see the “standard” logit estimator really start to have bad coverare until rho is ~ 0.2.

I’m not familiar with the small sample properties of all of these methods but the vcovBS help file states:

The “xy” or “pairs” bootstrap is consistent for heteroscedasticity and clustered errors, and converges to the asymptotic solution used in vcovCL as R , 𝑛 , and 𝑔 become large (𝑛 and 𝑔 are the number of observations and the number of clusters, respectively; see Efron 1979, or Mammen 1992, for a discussion of bootstrap asymptotics)

and

For small 𝑔 – particularly under 30 groups – the bootstrap will converge to a slightly different value than the asymptotic method, due to the limited number of distinct bootstrap replications possible (see Webb 2014 for a discussion of this phenomonon). The bootstrap will not necessarily converge to an asymptotic estimate that has been corrected for small samples

You have small g (25) and small n (2 per subject) so we might expect the bootstrap to be weird here. Looking at the Zeileis paper they cite later we see the closest sim they did to yours being experiment IV which might suggest we get under coverage for the paired BS estimator for G ~ 25. :

When I add vcovCL as an alternative I get pretty good alignment between this and the regular logit which probably suggests the “xy” bootstrap is breaking down because of R/n/g not being large enough.

library(ggplot2)

n <- 24

cl <- parallel::makeCluster(parallel::detectCores() - 1)
parallel::clusterExport(cl, "n")


SimRes <- parallel::parSapply(cl, 1:1e5, function(i) {
  SimData <- data.frame(Subject = rep(paste0("Subject", 1:n), 2),
                        Test = c(rep("Test1", n), rep("Test2", n)),
                        Result = sample(c("-", "+"), 2*n, replace = TRUE))
  SimTab <- table(SimData[SimData$Test == "Test1",]$Result,
                  SimData[SimData$Test == "Test2",]$Result)
  fit <- glm(Result == "+" ~ Test, data = SimData,
             family = binomial(link = "logit"))
  c(LogReg = summary(fit)$coefficients["TestTest2", "Pr(>|z|)"],
    LogRegCluster = lmtest::coeftest(
      fit, sandwich::vcovCL(fit, cluster = ~ Subject, type = "HC3")
    )["TestTest2", "Pr(>|z|)"])
})

parallel::stopCluster(cl)

SimRes <- data.table(t(SimRes))


colMeans(SimRes < 0.05)
 p.lg  p.cl 
0.041 0.039

TLDR: I think @JiaqiLi is right on both accounts. We won’t expect LogReg to start to perform poorly until you incorporate correlation within subjects and there’s likely some small sample wonkiness in play at least with your BS estimator.

timdisher · April 25, 2025, 5:01pm

Just as a quick and dirty sense check I tried to look at what happens when you induce correlation between the tests. I did this by simulating some correlated data on a latent scale and then transforming that back to 0/1. Since this is a many to one transformation the rho itself doesn’t translate into the final data but it does enough to generate data that I think is correct. This was pretty hasty and it’s the end of the week so happy to have someone fix/improve my approach

As expected once the tests are correlated we see LogReg start to perform very poorly compared to the alternatives.

library(survival)
library(data.table)
library(ggplot2)

n <- 100
lat.rho <- 0.5

cl <- parallel::makeCluster(parallel::detectCores() - 1)
parallel::clusterExport(cl, "n")
parallel::clusterExport(cl, "lat.rho")

parallel::clusterEvalQ(cl, library(survival))


SimRes <- parallel::parSapply(cl, 1:1e5, function(i) {

   
    SimData <- data.frame(plogis(MASS::mvrnorm(100, mu = qlogis(c(0.5, 0.5)), Sigma = matrix(c(1,lat.rho,lat.rho, 1), nrow = 2))) > 0.5) |> 
      dplyr::mutate_all(as.numeric) |> 
      `colnames<-`(c("Test1", "Test2")) |> 
      dplyr::mutate(Subject = 1:dplyr::n()) |> 
      tidyr::pivot_longer(cols = -Subject, names_to = "Test", values_to = "Result")
    
  SimTab <- table(SimData[SimData$Test == "Test1",]$Result,
                  SimData[SimData$Test == "Test2",]$Result)
  fit <- glm(Result ~ Test, data = SimData,
             family = binomial(link = "logit"))
  c(McNemar = mcnemar.test(SimTab)$p.value,
    McNemarNoContCorr = mcnemar.test(SimTab, correct = FALSE)$p.value,
    ExactMcNemar = exact2x2::exact2x2(SimTab, tsmethod = "central",
                                      paired = TRUE)$p.value,
    CondLogReg = summary(clogit(Result ~ Test + strata(Subject),
                                data = SimData)
    )$coefficients["TestTest2", "Pr(>|z|)"],
    LogReg = summary(fit)$coefficients["TestTest2", "Pr(>|z|)"],
    LogRegCluster = lmtest::coeftest(
      fit, sandwich::vcovCL(fit,type = "HC3", cluster = ~ Subject)
    )["TestTest2", "Pr(>|z|)"])
})

parallel::stopCluster(cl)

SimRes <- data.table(t(SimRes))


cbind(colMeans(SimRes < 0.05))
McNemar           0.03081
McNemarNoContCorr 0.04973
ExactMcNemar      0.03101
CondLogReg        0.04501
LogReg            0.01899
LogRegCluster     0.04973

tamas.ferenci · April 25, 2025, 6:21pm

Thank you very much!

One thought I had is that the way the data is coded doesn’t seem to establish any correlation between Test1 and Test2 for the same individual — they appear to be independent rather than paired.

Fair point! More precisely, while it invalidates my surprise on why simple regression is working well, it is still an open question why it is better than McNemar, or why it gets worse with the correction. (It is just an unnecessary correction in this case.)

Nevertheless, I updated my simulation with allowing correlation!

Your remark turned out to be correct: simple logistic regression gets (much-much) worse with correlation. And clustering adjustment corrects it! So that’s good news, the bad news is that McNemar also gets worse, which is something I simply don’t understand. (Actually I think it maybe just a bug in my code.)

Also, I suspect that the sample size of 24 might be too small.

That’s a reasonable remark, but shouldn’t McNemar (with continuity correction or in its exact version) work well on any sample size…?

tamas.ferenci · April 25, 2025, 6:57pm

That’s a great idea, thank you very much! I added HC3 (so far we only had bootstrap for the adjustment of logistic regression for clustering), and it seems that it is indeed better!

tamas.ferenci · April 25, 2025, 7:05pm

That’s funny, it seems we did exactly the same experiment at the same time The only difference is that you used

plogis(MASS::mvrnorm(100, mu = qlogis(c(0.5, 0.5)),
                     Sigma = matrix(c(1,lat.rho,lat.rho, 1), nrow = 2))) > 0.5

while I used

mvtnorm::rmvnorm(n, mean = c(0, 0),
                 sigma = matrix(c(1, r, r, 1), ncol = 2)) > 0

(Are these equivalent? At first glance, yes.)

Although our results are somewhat dissimilar, that’s probably due to the sample size (you used 100, I sticked to 24). That issue still has to be explored, although I thought small sample should not be a problem; see my previous remark.

timdisher · April 25, 2025, 7:47pm

mvtnorm::rmvnorm(n, mean = c(0, 0),
                 sigma = matrix(c(1, r, r, 1), ncol = 2)) > 0

I think this is essentially a stick breaking multivariate probit whereas mine was a kind of hacky logistic version that simulated the model predicted event probability and then just used 0.5 as the threshold for a positive test. Yours is I guess more akin to if you draw a binomial event with probability p based on the model predicted prob of event I guess? Sort of like a predictive distribution? They have the same underlying idea though so I wouldn’t expect this to be the source of any differences in our code. I don’t know enough about small sample McNemar properties to have an opinion on that though.

FWIW I re-ran my sim with n = 25 and McNemarNoContCorr seems to consistently exactly get same p-value as cluster which seems approximately nominal. I would have expected McNemar and exact to be more conservative based on this paper but it does seem the opposite is true for some reason. The no continuity performing well even in small samples seem to hold though.

McNemar           0.03147
McNemarNoContCorr 0.05037
ExactMcNemar      0.03163
CondLogReg        0.04562
LogReg            0.01888
LogRegCluster     0.05037

tamas.ferenci · April 25, 2025, 10:03pm

Thanks, I see your point! And I agree, this is probably not really relevant at all, the important is that we can adjust the correlation (in a quantitative way).

I re-run the whole simulation with a few different sample sizes to have a better idea on how does it affect the results. I updated my question with the results – pretty interesting things came out of it! I summarized the most important findings in my update, at the end.

timdisher · April 25, 2025, 10:39pm

Had some food and got my brain working again. I guess since we’re looking at type 1 error here the exact/mcnemar being conservative aligns well with less than nominal error. Stumped on what’s going on in your larger samples though will be interested to see if someone clever has the intuition!

tamas.ferenci · April 26, 2025, 7:15am

As you say, I simply compared the p-value to 0.05, so yes, that’s indeed just a Type I error rate (with a nominal \alpha of 0.05).

Using some more computational power, I re-run the analysis with a much finer grid (n = c(seq(10, 100, 10), seq(120, 200, 20), 300, 500) and r = seq(0, 0.9, 0.1)), but the results are largely unchanged:

timdisher · May 12, 2025, 7:46pm

I happen to have started on a diagnostic testing project so this popped back to front of mind. One thing I realized after looking more into McNemars is that the effective sample size is determined by the discordant pairs (McNemar And Mann-Whitney U Tests - StatPearls - NCBI Bookshelf) so when correlation is increasing it makes sense that McNemar is becoming more and more conservative (lower than nominal alpha) based on what we discussed previously.

I am prepping for a conference so don’t have a lot of extra time, but I did just look quickly at negative correlations (which would increase discordant pairs) and found that we see approximately the opposite trend, which might mean that what you’re seeing here is a function of those small sample properties.

I also noticed that when r is - naive logistic regression flips from being overly conservative to anti-conservative (rejects ~ 13% of the time with very negative) which I guess makes sense based on Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y) but I got backwards assuming it would be similar to direction when combining multiple effect sizes in meta analysis where ignoring positive correlation gives you more type 1 errors.

One thing I did have time for was just to look at average effective sample size for 200 patients across a range of correlations.

      r `mean(value)`
  <dbl>         <dbl>
1  -0.9         171. 
2  -0.7         149. 
3  -0.5         133. 
4   0           100. 
5   0.5          66.6
6   0.7          50.6
7   0.9          28.7

Which is consistent with the general idea re: why you’re seeing the pattern you are.