Hello

I was wondering if cost of, say certain surgical procedure, can be modeled with negative binomial distribution. To my understanding, cost is a continuous variable and is not a count.

Searching for the phrase “cost model negative binomial” in Google returns several papers on the topic where comparisons have been made among several models including NB.

Any thoughts on this would be appreciated.

Thanks

drezap
July 9, 2019, 11:57pm
#2
I’m interested.

In class, etc. the negative binomial distribution has been used to model count of times until failure.

So estimating the parameters of a negative binomial distribution using existing data to model cost of surgery may not give you the results you want.

But what are you interested in? May be there’s another distribution one could use as a likelihood to model cost of a surgical procedure.

No experience here, but it might suffice to use a symmetric real distribution such as normal or student t.

What existing literature have you observed that models cost of surgical procedure and what likelihood functions have they used?

Specifically, my question is if using negative binomial is valid for modeling cost. I suspect it is not but in the literature, I’ve found a few examples of modeling cost data using NB.

For example

PC Austin, WA Ghali and JV Tu,
Statistics in medicine , Sep 2003 15

Investigators in clinical research are often interested in determining the association between patient characteristics and cost of medical or surgical treatment. However, there is no uniformly agreed upon regression model with which to analyse cost data. The objective of the current study was to compare the performance of linear regression, linear regression with log-transformed cost, generalized linear models with Poisson, negative binomial and gamma distributions, median regression, and proportional hazards models for analysing costs in a cohort of patients undergoing CABG surgery. The study was performed on data comprising 1959 patients who underwent CABG surgery in Calgary, Alberta, between June 1994 and March 1998. Ten of 21 patient characteristics were significantly associated with cost of surgery in all seven models. Eight variables were not significantly associated with cost of surgery in all seven models. Using mean squared prediction error as a loss function, proportional hazards regression and the three generalized linear models were best able to predict cost in independent validation data. Using mean absolute error, linear regression with log-transformed cost, proportional hazards regression, and median regression to predict median cost, were best able to predict cost in independent validation data. Since the models demonstrated good consistency in identifying factors associated with increased cost of CABG surgery, any of the seven models can be used for identifying factors associated with increased cost of surgery. However, the magnitude of, and the interpretation of, the coefficients vary across models. Researchers are encouraged to consider a variety of candidate models, including those better known in the econometrics literature, rather than begin data analysis with one regression model selected a priori. The final choice of regression model should be made after a careful assessment of how best to assess predictive ability and should be tailored to the particular data in question.

A Akbarzadeh Baghban, A Pourhoseingholi, F Zayeri, S Ashtari and MR Zali,
Arab journal of gastroenterology : the official publication of the Pan-Arab Association of Gastroenterology , Dec 2013

Recent studies have shown that the high prevalence and the various clinical presentations of gastro-oesophageal reflux disease (GERD) and dyspepsia impose an enormous economic burden on society. Economic cost data have unique characteristics: they are counts, and they have zero inflation. Therefore, these data require special models. Poisson regression (PR), negative binomial regression (NB), zero inflated Poisson (ZIP) and zero inflated negative binomial (ZINB) regression are the models used for analysing cost data in this paper.In this study, a cross-sectional household survey was distributed to a random sample of individuals between May 2006 and December 2007 in the Tehran province of Iran to determine the prevalence of gastrointestinal symptoms and disorders and their related factors. The cost associated with each item was calculated. PR, NB, ZIP and ZINB models were used to analyse the data. The likelihood ratio test and the Voung test were used to conduct pairwise comparisons of the models. The log likelihood, the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) were used to compare the performances of the models.According to the likelihood ratio test and the Voung test and all three criteria used to compare the performance of the models, ZINB regression was identified as the best model for analysing the cost data. Sex, age, smoking status, BMI, insurance status and education were significant predictors.Because the NB model demonstrated a better fit than the PR and ZIP models, over-dispersion was clearly only due to unobserved heterogeneity. In contrast, according to the likelihood ratio test, the ZINB model was more appropriate than the ZIP model. The ZINB model for the cost data was more appropriate than the other models.

More sensible approach would be to use Gamma distribution under GLM framework since cost data are right skewed.

drezap
July 10, 2019, 12:52am
#4
No, it’s generally used to model counts of failure until success

drezap
July 10, 2019, 1:00am
#5
Search the statistics literature instead, and mathematics literature, and see what the motivation for the NB distribution was for. Don’t search clinical literature.

drezap
July 10, 2019, 1:14am
#6
Sure, seems possible. But like any modeler, it would take assessing model fit, with whatever tools you have. No model should be taken for granted on any application. Even same application different data model might need adjustment.

@eraheem

I thought of an example where a non-gaussian likelihood is used.

In actuarial science, because insurance claims data is zero inflated and over dispersed (lots of 0’s and high variance and sometimes multimodality), people use the “Tweedie” distribution as a likelihood function, it’s a poisson-gamma mixture. It wouldn’t make sense to model it with a gaussian likelihood.

Any actuarial textbook will have examples. There’s a bunch of interesting slides on Casualty Actuarial Society’s website, too. Possibly could have clinical applications, if that kind of data arises.

1 Like