Statistics in the era of big data


I have a question about statistics at large for health services research/epidemiological studies. Say, I have data on the entire residents of a state. I do a comparative effectiveness (or safety) observational study of two treatments, and get my point estimates on absolute/relative risk with regression adjustments or any other advanced methods. In this case, why can’t I directly compare my point estimates without any statistical tests, see if it is above a minimal clinically important difference, and infer a conclusion? Let’s say it was a national data of all individuals. Why do I need to calculate a confidence interval around it, since it is the whole population and not a sample thereby obviating sampling variability? Am I still supposed to consider it a sample of a hypothetical population? Sorry if the question sounds stupid, and thank you.


only a fraction of the residents would be exposed to the treatment, and as youre analysing the data they are already out-of-date, even the statistical methodology you are using will soon be out of date, immigration, incomplete data etc, and record linkage is not entirely error-free - i think i’ve seen bootstrapping used etc. I used to work for the australian bureau of stats and i guess we reported %s, otherwise expect CIs…


Another way to rationalize using confidence intervals etc. is if you want to consider sampling variability caused by using only one slice of time for enrolling subjects.


This might be implied by Paul Brown’s “incomplete data” comment, but surely the study referenced in the OP did not manage to measure the variables perfectly. A part of estimate uncertainty is measurement error.