Censored binomial models

Ahmed_Sayed · May 5, 2026, 10:39pm

Hi all,

Suppose I have the following problem:

I have 100 surgeons, each of whom perform N{i} procedures in a given year to treat a particular condition.
For our particular condition, they have the option of performing 2 procedures: A vs B
I am interested in modelling the probability that a given physician would choose to perform A rather than B based on a given set of predictor variables.
I can not simply use a binomial model/classic logistic regression because the database censors all records of a surgeon doing 10 or fewer procedures per year. For example, if surgeon performs operation A 16 times and operation B 7 times, I would be able to see that they did operation A 16 times but I would only know that they did operation B 10 times or less. Only surgeons with N of 11 or more are included, so you always know at least one of A or B. This sort of censoring is often done when there are concerns re: confidentiality on public databases.

One sensible approach I’ve found is outlined here. The approach basically consists of using the standard binomial likelihood for all exact observations (in this case, anything ≥11) and using the cumulative distribution function for all censored observation (<11).

Another intuition that I had (which I suspect is wrong) is to just use an ordinal logistic model to directly model the count of procedure A done (where the outcome variable <10, 11, 12, etc.) while controlling for the total number (A + B = N) of procedures performed by a given surgeon. I suspect this is not quite right because adjusting for N it doesn’t factor in the fact that A is always ≤ N.

I’m wondering if anyone has tackled a similar issue before or whether they have any alternative suggestions on how to tackle the problem.

f2harrell · May 6, 2026, 11:30am

Excellent question. Side comment: the censoring used in the health system reflects a poor understanding of the random effects models commonly used in quality outcomes research whereby providers with small case counts are discounted / shrunk towards the mean.

js592 · May 6, 2026, 5:52pm

The derivation of the censored likelihood looks correct, but I wonder in this case, because the physicians have to choose it might be best to approach it via a Bradley-Terry/discrete choice type model. I known in the uncensored case Bradley-Terry ends up being equivalent to logistic regression, provided the winner/binary coding is kept consistent. I wonder if the consistency holds for the “win m of n” trials when censoring is present.

This might be one of the cases where it’s worth it to go through the combinatorics and derive the appropriate pmf probabilities directly.

Ahmed_Sayed · May 6, 2026, 6:25pm

Thanks! My understanding is that the censoring is usually not done for analytic purposes per se but more so for confidentiality (e.g., that we might be able to dig into who the patient is if they’re the only one who underwent a procedure with a given physician).

Ahmed_Sayed · May 7, 2026, 11:37pm

Oh, this is actually the first time I’d heard of a Bradley-Terry model! My hunch is that it would likely end up being the same since the censoring in a censored binomial is handled by summing up PMFs (CDF) of censored observations, so it should be equivalent to a Bradley-Terry model in the same way standard logistic regression is.

One thing that’s easier to use (since there isn’t a whole lot of support for censored binomial models) is a censored Poisson (you would use the log of N as an offset), which is implemented in more packages (mgcv, brms, INLA, and probably others as well). You would need to use robust standard errors to rectify the CIs (which I was planning to do anyway since I have a few clusters) and you would need to be okay with RRs rather than ORs as the default contrast.

davidcnorrismd · May 15, 2026, 9:36am

Thanks for posting this very interesting problem, Ahmed. I made efforts toward a JAGS model for this, and have begun to think that it involves both truncation and censoring working together. (Surgeons for whom both A_i\le 10 and B_i\le 10 are truncated.)

I wonder what people think of the likelihood I’ve assembled here. This looks like a good opportunity to make subtle mistakes!

var S,    # number of surgeons (S ≡ No + Na + Nb)
    No,   # surgeons 1:No have both A and B counts observed
    Na,   # surgeons No+(1:Na) have only A count observed
    Nb,   # surgeons No+Na+(1:Nb) have only B count observed
    a[S], # a[s] = probability surgeon s chooses A
    N[S], # N[s] = true (poss. unk.) count of procedures for surgeon s
    L[S], # for s in (No+1):S, equal to whichever of {A,B} is observed
    A[S], # left-censored count of A procedures
    B[S]; # left-censored count of B procedures

# We introduce L[No+1:S] simply to avoid RUNTIME ERROR
# 'Possible directed cycle' in the truncated-N blocks
# of the model.
data {
  for (s in 1:No)         { L[s] = 0 }
  for (s in No+(1:Na))    { L[s] = A[s] }
  for (s in No+Na+(1:Nb)) { L[s] = B[s] }  
}

model {
  for (s in 1:S) {
    a[s] ~ dbeta(0.8, 0.6) # bimodal, as if surgeons tend to be A or B 'partisans'
  }
  for (s in 1:No) {
    N[s] ~ dnegbin(0.1, 4)
    A[s] ~ dbin(a[s], N[s])
  }
  for (s in No+(1:Na)) {
    N[s] ~ dnegbin(0.1, 4) |> T(L[s], L[s]+10)
    A[s] ~ dbin(a[s], N[s])
  }
  for (s in No+Na+(1:Nb)) {
    N[s] ~ dnegbin(0.1, 4) |> T(L[s], L[s]+10)
    B[s] ~ dbin(1-a[s], N[s])
  }
}

davidcnorrismd · May 17, 2026, 12:48pm

On further thought, I think the normalization done by JAGS’s truncation won’t yield the desired likelihood here. A snippet from the JAGS 5.0.0 User Manual:

But the likelihood we really want is not too difficult to write down, I think, and then you can just open up a can of MCMC on that.