BBR Session 2: Probability and Descriptive Statistics

You can summarize an ordinal variable with k distinct categories by computing the proportion of values at each level no matter how large k is. You can also summarize the variable using cumulative proportions (empirical cumulative distribution function). It’s just that if k is large there are a lot of proportions to display.

2 Likes

Zero is truly the devil’s number! It requires investigation in its own right. Do zeros originate from the same process that generated the rest of the data? Do all zeros originate from the same process? And does and absence of zeros indicate a sampling problem? There’s scope for a fun paper on The Eternal Zero!

That has been addressed in Heckman’s two-stage model. Fit a binary logistic model for Y=0 vs Y>0 and then an ordinal or count model conditional on Y>0 and compare the covariate effects in the two models.

2 Likes

But what if the process isn’t a clean two-stage? If you look at visits to the doctor in the previous year, the zeros comprise people who visit the doctor but didn’t visit in the last year plus people who never visit the doctor (zero inflated).

And in water quality testing, zeros are sample zeros – no bugs in 100 ml. This doesn’t mean the water is bug free. It makes more sense to treat zero as a limit of detection in this case, and use interval regression.

There a similar case to be made for floor effects. Skewed data with a lot of zeros may indicate a test floor effect problem rather than true zerosity.

I sometimes think that data analysis has two questions : what determines whether we seeing something rather than nothing, and what determines the something that we see.

So you really need to find out about the data generation process before deciding what your zeros are.

1 Like

Heckman’s two-stage approach makes so few assumptions that it would, I think, allow for what you’re worried about. It allows the reasons for some vs. no visits to be different than the reasons for many vs. a few visits.

For those interested in Heckman’s two-stage model: “The Common Structure of Statistical Models of Truncation, Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models” by James J. Heckman

I have used zero-inflated poisson models and hurdle models in the post, although when the number of zero’s in the dataset are very high (>50% of the data), I have had a difficult time implementing them.

2 Likes