Sampling, bootstrapping for confidence intervals and hypothesis testing using tidyverse

Sanne · October 24, 2023, 1:10pm

Hi,

I was brushing up my sampling and bootstrapping skills in R, and wanted to apply “best” practices from a tidyverse point of view.

I came across the book Statistical Inference via Data Science: A ModernDive into R and the Tidyverse. The book introduces the infer package for statistical inference using (modern) sampling methods. The book, with code examples, can be found online here: https://moderndive.com/

The first 6 chapters are introductory of nature, involving single and multiple regression. Chapters 7-10 are at the heart of the matter. The book builds the material from the very start, in a visual and non-mathematical way. Excellent for those who are starting in this field, and/or are hesitant on mathematics.

Looking at it more from a distance, it is still to early if the infer package makes sense from a production perspective. The techniques are in a way so elementary that the question arises if one would need a library like infer for it in the first place. I do think that it is good to have a such a library to focus needs, ideas and questions. So yes, I will give it a try.

Cheers,

Sanne

f2harrell · October 24, 2023, 2:36pm

I highly recommend against using the tidyverse for reasons of personal productivity. The antidote is here. More to your points, most practitioners don’t know that the bootstrap is a pretty approximate approach, and in many cases bootstrap confidence intervals are not very accurate. So I wouldn’t go all-in for using the bootstrap for inference. I view the bootstrap as what you should consider using if you are avoiding Bayes.

Sanne · October 24, 2023, 3:34pm

Thanks for the feedback.

I think that tidyverse, also tidymodels, provides quite some nice API’s. But then there it is: is it possible to cover all needs in an API? And what if your needs just fall foul?

The book by the way is quite cavalier, stating at the end of chapter 9: ’ As we’ll show in this section, any theory-based method is ultimately an approximation to the simulation-based method. The theory-based method we’ll focus on is known as the two-sample t -test for testing differences in sample means.’

Cheers!

f2harrell · October 24, 2023, 4:12pm

If the book calls the bootstrap a simulation method, that is misleading in that particular context. It’s nowhere near as accurate as true Monte Carlo simulation (which of course isn’t so useful because you have to know the true model). And the bootstrap can fix a poorly specified model (e.g., analyzing Y when your should have been analyzing log(Y+constant).