Statistics in the era of big data

Madhu · March 9, 2019, 12:35am

I have a question about statistics at large for health services research/epidemiological studies. Say, I have data on the entire residents of a state. I do a comparative effectiveness (or safety) observational study of two treatments, and get my point estimates on absolute/relative risk with regression adjustments or any other advanced methods. In this case, why can’t I directly compare my point estimates without any statistical tests, see if it is above a minimal clinically important difference, and infer a conclusion? Let’s say it was a national data of all individuals. Why do I need to calculate a confidence interval around it, since it is the whole population and not a sample thereby obviating sampling variability? Am I still supposed to consider it a sample of a hypothetical population? Sorry if the question sounds stupid, and thank you.

PaulBrownPhD · March 9, 2019, 6:32pm

only a fraction of the residents would be exposed to the treatment, and as youre analysing the data they are already out-of-date, even the statistical methodology you are using will soon be out of date, immigration, incomplete data etc, and record linkage is not entirely error-free - i think i’ve seen bootstrapping used etc. I used to work for the australian bureau of stats and i guess we reported %s, otherwise expect CIs…

f2harrell · March 9, 2019, 8:39pm

Another way to rationalize using confidence intervals etc. is if you want to consider sampling variability caused by using only one slice of time for enrolling subjects.

njdoogan · March 15, 2019, 7:56pm

This might be implied by Paul Brown’s “incomplete data” comment, but surely the study referenced in the OP did not manage to measure the variables perfectly. A part of estimate uncertainty is measurement error.