Sparse data with zero cells in logistic regression model

I would like to apply the traditional logistic regression, but there are limitations. My data has a total of 18 variables (or 36 coefficients in the LRM with intercept and dummy variables) and 101 observations. My predictors consist of a mix of continuous and categorical variables.

I obtained the following warning messages:

1: algorithm did not converge 
2: fitted probabilities numerically 0 or 1 occurred 

The p-values are all 1 or nearing 1. I increased the maximum iterations by specifying control=glm.control(maxit=50). The algorithm converged with a number of fisher scoring iterations of 28, but the following warning message still appears and the p-values are still so large at 1.

Warning message: fitted probabilities numerically 0 or 1 occurred 

I understand from reading some posts here that this is due to complete separation due to small size / sparse data, where some cells have missing observations. I hope to seek some advice on how to proceed. Which of the following should I attempt to do:

  1. Exact Logistic Regression Model
  2. Penalised Regression (• Penalized maximum likelihood logistic regression (Firth method)?)
  3. Bayesian Analysis

I explored (1) and read about elrm in r, but I gathered that they only take categorical predictors. Any resources out there that I can refer to? Or perhaps the name of the method that is suitable, so that I can search on the internet.


Complete separation is caused by either an inadequate sample size (extreme overfitting) or by a predictor that is perfect in at least one direction (infinite regression coefficient). Your problem is the former. Your sample size is not adequate for fitting a binary logistic model containing the intercept and one pre-specified predictor. Your sample size is also too small for you to be able to solve for how much penalization should be done. The only hope is to use data reduction (unsupervised learning; masked to Y) to reduce the dimensionality of the problem down to say 2-3 dimensions, score those dimensions (e.g., using principal components) then fit those scores in the model.

Just to clarify the point about the sample size being too small to determine the appropriate amount of shrinkage: How useful would the use of external information in the form of Bayesian priors be in a situation like this? For example, using information from previous studies to specify quite tight priors on all/most coefficients? Is this unlikely to be able to solve the extreme overfitting problem when sample sizes are this small?

1 Like

If the coefficients remain as free parameters, i.e., you don’t use any data reduction, then the use of prior information is the only hope for estimating the parameters. But if there is that much relevant prior information why did new data need to be collected? What proportion of the total information would come from the priors?


A complementary perspective on ‘data reduction’ is provided by the notion of ‘identifying restrictions’. Whereas the former operates heuristically in the empirical realm of data, the latter operates analytically in the realm of theory. (Perhaps, in a sufficiently abstract view of Bayesian modeling, these overlapping perspectives converge completely?) I like the latter notion because it invites deeper thinking about model form, and discourages the uncritical application of off-the-shelf models.

An example of identifying restrictions employed to estimate a Bayesian model against a small data set can be found in the arXiv paper I tweeted about here:

(Interestingly — somewhat in line with the question, “why did new data need to be collected?” — the paper concluded that the given data set shouldn’t have been collected.)