Bootstrap on regression models

Theresac · January 24, 2021, 8:52pm

I ran a multiple regression model and found a p value that was not significant (0.06). After bootstrapped it i find a p vale of 0.03. Can i report the p value after bootstraping. What are the advantages and disadvantages of doing that?.

Theresa

f2harrell · January 25, 2021, 12:47pm

Without knowing more, one might think that you performed the bootstrap because of the value of p you originally observed. This would not be a good reason to “jump methods” and implies that you believe that statistical “significance” using p-value thresholds is still a good thing (it’s not). Step back and write a statistical analysis plan. If you describe the data and goals we may be able to help. The plan needs to not be informed by the actual observed data. It may involve selecting a robust statistical model.

Theresac · January 25, 2021, 2:43pm

My initial plan did not include bootstraping. At the expense of jumping statistical methods, i am intrigued at how bootstraping can cause this much of a “significance” difference. My initial statistical plan was an inference multilinear model with splines. What am i supposed to look out for in planning an inference model with or without the bootstrap.

R_cubed · January 25, 2021, 4:48pm

Blockquote
i am intrigued at how bootstrapping can cause this much of a “significance” difference.

If you transform the P value into bits of information using log_2(p) when p=0.06, your data provided about 4 bits of information against the assert test hypothesis; p=0.03 provides about 5 bits against. So there isn’t as big a difference as the raw p value (and intuition) indicates.

P values are noisy, there is nothing particularly special about the 0.05 “significance” level. Your alpha needs to be judged in the context of power and prior odds.

If you used the probit transformation \Phi(p), that gives a random variable N(0,1) with a standard error of 1 (jump to pages 4- 6 for the justification).

When p=0.06, \Phi(p) \approx 1.56 \pm 1. Conditional on the default model holding, a future p could be as low as 0.005 ie. a \Phi(p) \approx 2.56 or as high as 0.29 \Phi(p) \approx 0.56.

ENB · January 26, 2021, 1:16am

This paper could be helpful in understanding the (in)significance of the change in p-values:

Gelman A, Stern H. The Difference between “Significant” and “Not Significant” Is Not Itself Statistically Significant. Am. Stat. [Internet]. [American Statistical Association, Taylor & Francis, Ltd.]; 2006;60(4):328–31. Available from: http://www.jstor.org/stable/27643811 (behind paywall)

Here is a version from the author (not behind a paywall):
http://www.stat.columbia.edu/~gelman/research/published/signif4.pdf

GAR · January 26, 2021, 11:46am

If you’re not familiar with the bootstrap, this introduction might help you understand what is going on (essentially, you use a different method to estimate sampling distributions, which p values rely on):
https://psyarxiv.com/kxarf/

For a more in depth coverage, including a dance of confidence intervals, there is more here:
https://psyarxiv.com/h8ft7/

Depending on the number of bootstrap samples, you might get a different p value if you run the same analysis again, because of random sampling. But irrespective of the p value and confidence interval, you need to describe your effect sizes (slopes in your case) and put them in context. Here are two very useful resources to report frequentist results:

f2harrell · January 26, 2021, 12:40pm

Excellent information. I wonder to what extent bootstrap p-value have been validated for accuracy. Bootstrap confidence intervals are often not accurate, i.e., they have poor non-coverage probability in at least one of the two tails.

GAR · January 26, 2021, 1:38pm

As you would expect, most bootstrap validations are focused on false positive rates, then on confidence intervals. I’ve seen very little on p values but of course CI and p values are directly related to each other. Teaching bootstrap methods is a great way to illustrate that each combination of CI forming method and estimator behaves differently, and that some combinations gives long-term results far from the expected behaviour. The simplest demonstration is to use a percentile bootstrap to get a confidence interval for the population mean when sampling from a skewed distribution. It’s very striking for students to realise that a so-called 95% CI can actually be an 88% CI.

f2harrell · January 26, 2021, 3:14pm

Yes the nonparametric bootstrap percentile method has awful confidence coverage. I’ll bet you’d also be disappointed in bootstrap p-value accuracy. I would rather go with a Bayesian model that has a non-normality parameter and a variance ratio parameter (for the two-sample t-test setup) to get exact inference.

GAR · January 26, 2021, 3:27pm

I think there are a lot of educational benefits in teaching the bootstrap, if only to learn about simulations and to really understand sampling distributions, but if the goal is to compare two means, this Bayesian approach using t distributions works very well:

f2harrell · January 27, 2021, 11:54am

But beware. The bootstrap does not mimic sampling distributions in general, and simulations using resampling of one dataset are different from Monte-Carlo simulations of different datasets, which create true sampling distributions. I think of the bootstrap as an exceptionally handy approximate method that is better than not using the bootstrap, but that doesn’t get us all the way there.

GAR · January 27, 2021, 12:14pm

Absolutely: once students understand the bootstrap and what it tries to achieve, it is much easier to introduce Monte-Carlo simulations. In simple examples, very little code needs to be changed to switch from bootstrap to a proper simulation. Also, dealing with the full bootstrap distributions (as opposed to only looking at the p values and the confidence intervals) is a great stepping stone to Bayesian posterior distributions. Even if they are not the same, it is a radical change of perspective to consider a full distribution of plausible parameters as opposed to a few summary statistics and the temptation of arbitrary dichotomies.