Logistic regression options for rare events and small samples

Hi there,

I’m working with a dataset that includes 62 participants (31 in the intervention group and 31 in the control group). The outcome variable is binary (0,1), and there were three events (one per participant in the intervention group) in the intervention group and one event in the control group over a specified period of time. I would like to use logistic regression, adjusting for 2 covariates (age, gender) to assess whether group membership is associated with the binary outcome of interest. I’ve been attempting to run a typical logistic glm:

model <- glm(outcome ~ group + age + gender, family=binomial, data=mydata)

However, my standard errors are astronomical. I looked into this, and I found a few solutions:

  • exact logistic regression (elrm package)
  • penalized logistic regression using the Firth approach (logistf package)
  • rare events logistic regression (zelig package, model = "relogit")

While I was able to run those models, I’m still getting large (but markedly smaller) standard errors for my intercept (in the 1000s) and predictors. I’m wondering if anyone has any suggestions, or whether I should be using a different approach entirely.

Thank you.

As detailed in my two sets of course notes [BBR](http://hbiostat.org/doc/bbr.pdf] and RMS the minimum sample size to fit a binary logistic model with no predictors (i.e., only with the intercept) is n=96.

1 Like

I see that Frank has already replied, but I was going to essentially say, you don’t have enough data to even do a bivariate Fisher Exact, Uncorrected Chi-Square or N-1 Chi-Square comparing the two arms as is, much less have a stable multivariable model.

For each of the above bivariate tests, you get p values of 0.6, 0.3, and 0.3, respectively.

So, presuming that this was a prospectively designed, randomized study, and this was your primary endpoint analysis, the study was markedly underpowered, or the design assumptions were fundamentally off.

If this was not your primary endpoint, but some kind of post hoc, perhaps even sub-group, exploratory analysis, you are experiencing the expected problems in that setting.


Thank you for the feedback. The study itself is prospectively designed, but observational. Due to COVID-19, data collection was cut short, so I’m working with a finite (the sample covers all possible individuals in the study’s scope), small dataset. Unfortunately, this outcome is my primary endpoint. So, you’re saying that no statistical test can be used, and that I should just report the raw values?

1 Like


If the study is observational, how were subjects allocated to each treatment group?

Your data, albeit aborted, would suggest some kind of relatively equal allocations to the two groups, which is curious on the surface. Or, is that an artifact of the premature termination of the study?

This raises the likelihood of confounding relative to assessing outcome differences between the groups, and per the prior comments, you don’t really have enough data to effectively explore that.

If you had more data, with more events, one approach in an observational setting, albeit, not without its critics, would be to use a propensity score adjustment. You would create an initial logistic regression model where the treatment group would be the response, and you might be able to have a small number of covariates, with 31 subjects in each group:

Group ~ Age + Gender + …

You would then generate predictions on the logit scale from that model, and those predictions (PSLogit) would become an input covariate for each subject in a new model with your endpoint as the response, for example:

Outcome ~ Group + PSLogit

So the single covariate, PSLogit, would embody possibly multiple characteristics for each patient.

However, that gets back to the original problem, which is insufficient data to begin with, relative to assessing differences in outcomes between the two groups. Even with propensity scores, you would be limited in being able to include a reasonable number of covariates in the initial model to reasonably explain observed differences between the groups, much less unobserved (unmeasured) differences.

I recently dealt with an aborted randomized study, well before enrollment was completed, and we made the a priori decision before analyses, to only report descriptive statistics. Thus, there was not a single p value or confidence interval anywhere in the report. That path might be apropos here for you.

1 Like

Thank you again for your very thoughtful input. I meant “treatment” and “control” generally - I’m looking at participants with Medicare (group 1) vs. commercial insurance (group 2) to assess complication rates following receiving the same procedure. Medicare began reimbursing this particular procedure as of this year. So data collection started this year, and was supposed to continue for ~6 months, but again, ended prematurely. Longer term follow-up data on the same set of participants may be collected, but it sounds like the small N will continue to be a limiting factor.

The propensity score adjustment idea sounds very interesting – I’ll look into that for another project I’m working on.

Just out of curiosity, however, why can I not use the methods I described above (e.g., exact logistic regression, Firth approach)? I’ve seen both applied in even smaller samples. Are you and Frank stating that those approaches are invalid in themselves?

Thanks again.

is there any other relevant outcome data?

There are some secondary (continuous) endpoints, yes.


Needless to say, comparing Medicare versus commercial introduces all kinds of confounding, not just because of discrete age differences (>=65 versus <65), but you also likely have a number of relevant comorbidities at a higher prevalence in the Medicare group that further complicates comparisons. In other words, age and gender adjustments are not likely to be sufficient, unless you case matched on other characteristics a priori.

Since you have a lack of data, you cannot effectively control for that, so it really comes down to what can you actually say about the observed differences between the groups, to some extent, in a vacuum.

It may come down to not making the formal inference that the event rate in the Medicare group is “significantly” higher than the commercial group, which would reasonably be expected, but in essence, is the event rate in the Medicare group acceptable, within some margin of error, which given your sample size, may be too large at present.

For example, for the 3 out of 31, which is 9.7%, you would have 95% confidence intervals of 3.3% - 24.9% (using Wilson Score intervals). That is a wide range.

How did you manage to get the same number of patients in each group? Did you plan to collect a consecutive series of patients in each group over time, until you enrolled the same number of patients in each, or is there some other mechanism involved?

The key question embodied in the above is whether or not you also have to consider patient selection bias in the two groups, that may further confound your comparisons. In other words, if you did not enroll a consecutive series, is there something different about the patients that were eligible for the study, but were not enrolled, as compared to the patients that were enrolled, that further complicates your comparisons.

With respect to the methods for small samples, not saying that they are invalid at all. However, all formal statistical methods have underlying assumptions, and one of the key general assumptions is having a sufficient sample size to have stable values of the parameters that you are estimating, within an acceptable level of uncertainty.

That, even with the application of those specific methods, you still have “large” (for some definition of large) standard errors, which in turn, likely means overly wide confidence intervals for your odds ratios, those methods are essentially still saying, “get more data”.

Bear in mind, that formalized statistical methods are about engaging in inference. You have some, presumably random, sample from a population. That population may be real or theoretical.

Using the sample, you are then attempting to make statements about the characteristics of the population from which the sample is drawn.

If your sample is too small, your estimates of various parameters will embody a level of uncertainty that is not likely acceptable.

As an example, you have a coin that you presume to be fair. You toss it four times, with an a priori expectation of two heads and two tails, but you get four heads. The probability of getting four heads in a row, if the coin is fair, is 6.25%. Would you now reject the hypothesis that the coin is fair, or would you toss it a larger number of times to observe the results and reduce the uncertainty?

With your limited dataset, you are essentially saying, you want to make a decision about the fairness of the coin after only four tosses.

You can still narrowly describe what was observed in the sample empirically, but you will have trouble making inferences about the population via formalized statistical methods.

That is why, in the aborted study that I referenced above, we did not engage in any formal statistical analyses. We simply reported what was observed, and in the discussion of the study, were sure to mention that the study was aborted, that the enrollment was far short of what was planned, and we were only reporting the observed results empirically, given the uncertainty in the results. In essence, we treated it as if it were a pilot study.


To emphasize Marc’s comments, this study is not analyzable and it is not worth reporting either, unless you report only confidence intervals for overall event probability (using Wilson confidence intervals).


In general, with such small samples sizes, I would be tempted to look into Bayesian methods with informative priors based on external/historical data. However, given the extreme confounding, I’m doubt that is worth doing on this particular occasion.

@MSchwartz: Thanks again for your advice. The fact that the groups are evenly sized is merely a coincidence. No other mechanism involved. All patients who met the criteria were included.

Thanks, @Bjoern. I actually do have data on other confounding variables (comorbidities, comorbidity indices), however, differences between the two groups were minor for those elements. I am interested in your comment about Bayesian methods for a separate project though. Could you point me in the direction of an example of where this is done using R? I’m new to Bayesian methods and would like to learn more about them.

@pmbrown: Yes complications is already a composite of a few different serious adverse events; however, I do also have minor adverse events that I could potentially combine for a composite ‘any adverse events’ outcome. Thanks for that idea.

In R, my first stop would be the rstanarm package (followed by the more complicated but more powerful brms followed by hand-coding something in rstan, where just about anything is possible - the main difficulty being discrete parameters). The great thing about rstanarm is that syntax familiar from glm or lme4 essentially works (see e.g. https://mc-stan.org/rstanarm/articles/rstanarm.html or https://cran.r-project.org/web/packages/rstanarm/vignettes/binomial.html) while allowing us to set prior distributions in a relatively flexible but relatively easy to understand way.

More generally, I love Bayesian Data Analysis (see http://www.stat.columbia.edu/~gelman/book/ where a pdf is available) and I know many people are also big fans of Statistical Rethinking (https://www.routledge.com/Statistical-Rethinking-A-Bayesian-Course-with-Examples-in-R-and-STAN/McElreath/p/book/9780367139919), which I have not worked through.

Of course, going down a Bayesian route does not remove confounding etc. and many of the same things that can be done to adjust for confounding in a frequentist setting (e.g. covariate adjustment, propensity score matching) will make sense for a Bayesian analysis.

1 Like

My main worry with that is if you put too much reliance on priors when you don’t have enough data it can get pretty scary.


I think one thing to clarify is that you may be actually able to calculate things with limited data, but as stressed by others, the numbers are basically unreliable. Medical literature is rife with this problem often as a result of not involving a statistician in a research project and then the journal not having adequate statistical review.

The issue you will have is that even if you have information on comorbidities, your data set can’t reliably support calculations that incorporate this information, even if software gives you numbers.

Yes and you don’t go far enough. In many projects the result is that the overall incidence of a simple event cannot even be estimated (i.e., the simplest estimate where all covariate effects are assumed to be zero).

Yes, this is something I’m working on addressing in my dept (the importance of a proper power analysis and statistical analysis plan prior to collecting data).

Thank you. Quick clarification – by “overall”, are you suggesting that I only present the proportion of participants with an event with a Wilson CI for the entire sample (medicare + commercial)? I was thinking I could just present them by insurance status (no comparative tests) and state that the study was not powered to detect group differences for this specific outcome.

Power is a more complicated concept than you need to discuss. You can give as many confidence intervals as you want, and each is self-contained as to its limitations.

1 Like