Specifically, my question is if using negative binomial is valid for modeling cost. I suspect it is not but in the literature, I’ve found a few examples of modeling cost data using NB.
PC Austin, WA Ghali and JV Tu,
Statistics in medicine, Sep 2003 15
Investigators in clinical research are often interested in determining the association between patient characteristics and cost of medical or surgical treatment. However, there is no uniformly agreed upon regression model with which to analyse cost data. The objective of the current study was to compare the performance of linear regression, linear regression with log-transformed cost, generalized linear models with Poisson, negative binomial and gamma distributions, median regression, and proportional hazards models for analysing costs in a cohort of patients undergoing CABG surgery. The study was performed on data comprising 1959 patients who underwent CABG surgery in Calgary, Alberta, between June 1994 and March 1998. Ten of 21 patient characteristics were significantly associated with cost of surgery in all seven models. Eight variables were not significantly associated with cost of surgery in all seven models. Using mean squared prediction error as a loss function, proportional hazards regression and the three generalized linear models were best able to predict cost in independent validation data. Using mean absolute error, linear regression with log-transformed cost, proportional hazards regression, and median regression to predict median cost, were best able to predict cost in independent validation data. Since the models demonstrated good consistency in identifying factors associated with increased cost of CABG surgery, any of the seven models can be used for identifying factors associated with increased cost of surgery. However, the magnitude of, and the interpretation of, the coefficients vary across models. Researchers are encouraged to consider a variety of candidate models, including those better known in the econometrics literature, rather than begin data analysis with one regression model selected a priori. The final choice of regression model should be made after a careful assessment of how best to assess predictive ability and should be tailored to the particular data in question.
A Akbarzadeh Baghban, A Pourhoseingholi, F Zayeri, S Ashtari and MR Zali,
Arab journal of gastroenterology : the official publication of the Pan-Arab Association of Gastroenterology, Dec 2013
Recent studies have shown that the high prevalence and the various clinical presentations of gastro-oesophageal reflux disease (GERD) and dyspepsia impose an enormous economic burden on society. Economic cost data have unique characteristics: they are counts, and they have zero inflation. Therefore, these data require special models. Poisson regression (PR), negative binomial regression (NB), zero inflated Poisson (ZIP) and zero inflated negative binomial (ZINB) regression are the models used for analysing cost data in this paper.In this study, a cross-sectional household survey was distributed to a random sample of individuals between May 2006 and December 2007 in the Tehran province of Iran to determine the prevalence of gastrointestinal symptoms and disorders and their related factors. The cost associated with each item was calculated. PR, NB, ZIP and ZINB models were used to analyse the data. The likelihood ratio test and the Voung test were used to conduct pairwise comparisons of the models. The log likelihood, the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) were used to compare the performances of the models.According to the likelihood ratio test and the Voung test and all three criteria used to compare the performance of the models, ZINB regression was identified as the best model for analysing the cost data. Sex, age, smoking status, BMI, insurance status and education were significant predictors.Because the NB model demonstrated a better fit than the PR and ZIP models, over-dispersion was clearly only due to unobserved heterogeneity. In contrast, according to the likelihood ratio test, the ZINB model was more appropriate than the ZIP model. The ZINB model for the cost data was more appropriate than the other models.
More sensible approach would be to use Gamma distribution under GLM framework since cost data are right skewed.