BBR Session 4: Hypothesis Testing, Branches of Statistics, P-values

Same issue here - I cannot open the .mkv file
thanks

VLC Media player is a free program that played it for me.

1 Like

Figured that out, but now to find someone with admin rights to my machine to download it
thanks!

After viewing the lecture video, I get the impression that the frequentist statistics do not answer the question that we want to ask or perhaps doing it in a roundabout way (through disproving the null rather than estimate the probability that the alternative is true). I also find the correct definition of p-values and confidence interval rather unintuitive (“in the long run”) as the data that we have are usually the only one that we have.

I can imagine that there must be other reasons out there that sustained the frequentist vs Bayesian debate, which made people still stayed in the frequentist camp. Could someone enlighten me on this?

You’ll see very long discussions about that on stats.stackexchange.com. My personal opinion about the reasons for people staying frequentist are:

  1. Stat courses teach mainly what the instructor thinks the students will need in the future, and most instructors were taught by the previous generation of instructors who felt the same way.
  2. Fisher had a huge PR machine, traveled all over the world, and had lots of students to spread the word.
  3. Bayes died before any of his work was known, and he had no students.
  4. Frequentists feel that their methods are more objective. Bayesians are certain that frequentist methods are more subjective.
  5. When frequentistism was on the rise, the computational tools needed to do Bayesian statistics were not available. Some frequentist methods can be done without a computer.
  6. Many statisticians are afraid of change just like other people. I feel there is a correlation between clinging to outdated statistical software such as SAS and clinging to frequentism.
  7. Some people don’t want to change because they feel they are in effect criticizing their former selves. I just think “Onward and upward!”.
1 Like

So, would science become better if more people converted to Bayesian statistics? Just like frequentists who proposed a lower P-value threshold?

And to those who remain reluctant to convert to the Bayes camp, how can they use P-value and confidence intervals correctly? It is hard to think of a way now since they have such an counter-intuitive meaning.

It is extremely difficult to use confidence intervals (better called compatibility intervals) and p-values correctly in my opinion. Some attempts are here. I firmly believe that science would be far better by attempting to answer the right question instead of more easily answering the wrong question. On that point see my take on Tukey’s classic statement here.

Then, would it be fair to say that the ease of getting p-values and CIs enables their mindless use in scientific research?

That’s a contributing factor.

Hi there!
Late arrival here, trying to catch up with the course - which is great, by the way, thank you!
I found the 4th session on statistical inference fascinating but also challenging, despite the fact that this is not the first time I am trying to wrap my head around frequentist vs Bayesian. I have a couple questions regarding this lecture:

  1. Concerning your often repeated “bon mot” about conclusions one can draw from a p-value over the significance threshold (“The money was spent”): I agree that the p-value alone says very little in that case. However if a study had a correct sample size estimation, made reasonable assumptions about distributions, chose appropriate tests and then arrived at a non-significant p-value, it must be more one can conclude. Surely those previous steps could also be faulty, but that possibility would be just as real had the p-value been significant. What would you suggest authors of such well-conducted studies do when faced with a non-significant p-value?
  2. I have listened several times but still can’t really wrap my head around why a type I error or α is not really an error - the chance of erroneously rejecting a true null-hypothesis seems to me like probability of making a mistake. Could you maybe provide a reference where this is elaborated on in more detail?
  3. Despite the fact that I have spent much time already reading up on these issues, I still find it difficult to convey the gist of it to my wife who is a full-time, busy researcher with the classical graduate-school education in basic frequentist statistics. Could you maybe point me to a reference which one could use as a “Come to the dark side!” introduction? :slight_smile:
    Thank you very much!
    Best regards,
    Aron Kerenyi, MD, PhD

The p-value is a post-study quantity but the other things you listed are pre-study concepts that do not directly apply once the study is over. So it is not possible to conclude anything other than “with the current sample size the data were unable to overcome the supposition of no effect”.

Think of what you do when a test statistic exceeds an arbitrary critical value or the p-value is less than an arbitrary cutoff. Your interpretation is fully conditional on there truly being no effect. So type I error probability, while possible to think of as an “alarm probability”, cannot tell you anything about the error you are interested in: a treatment not working when you interpret the evidence as saying it is working. You can’t assume X is true and then do a calculation about the probability that X is true. See the smoke alarm analogy in Statistical Thinking - My Journey from Frequentist to Bayesian Statistics.

Look at Introduction to Bayes for Evaluating Treatments and Richard McElreath’s Statistical Rethinking book.

Thanks for the terrific questions and please follow-up with more questions so I can try to make this clearer.

Thank you for your detailed answer, Frank! It has been enlightening to read your journey towards Bayes, and with many great references for further reading!

Regarding the non-significant p problem, I think I understand what you mean, ie having collected the data only affects p, all the other aspects I have mentioned have been decided pre-study. I still think that when faced with a large p-value, it might be worth re-iterating these pre-study decision, ie:
If the true effect size was larger then the threshold, pre-defined in the power-calculation, having repeated the experiment 20 times, 4 would yield similarly non-significant p-values (with a β of 0.8). However if the true effect size was smaller then the pre-defined threshold, 19 out of 20 experiments would yield similarly non-significant results.

Another idea would be to have a rescue, Bayesian backup-plan for non-significant p-values. I guess the problem is that you can’t really define reasonable priors having already looked at your data. But if one made such a “contingency plan” beforehand, could this be a way to re-interpret the date with more meaning? Or maybe just use flat priors? (Of course if one buys into Bayesian thinking than one would probably not do a frequentist analysis to begin with. But let’s assume a very frequentist journal editor for the sake of the argument)

Hi Aron. I still think that bringing pre-study design characteristics into the interpretation is not that helpful. As an aside, what (little) meaning that confidence intervals have they have independent of those pre-study parameters.

I don’t think the idea of Bayesian “rescues” is very strong from the standpoint of scientific rigor, unless possibly when a very skeptical prior distribution still results in a meaningful posterior distribution for the question at hand. We should plan Bayesian analyses prospectively as either the primary analysis or a planned parallel secondary analysis as we did for the recently reported ISCHEMIA study.

Thank you again, Frank, good point about the CI-s.

And thank you for pointing me to the ISCHEMIA study, it was fascinating to read the published methods paper and then see the results presentation. As far as I understand, if a Bayesian parallel analysis had not been performed the studies conclusion ("… did not reduce the overall rate…") could also be called into question, ie for not stating “our study could not reject the null hypothesis”. And since virtually flat priors were used I guess a similar Bayesian analysis could also be performed for all the trials that “overclaim” and in most cases would lead to the same conclusion.

My point is that even though it is statistically incorrect to conclude anything other than “the money was spent” from non-significant p-values, physicians are still likely to draw the clinically likely correct conclusions that the treatment effect is probably small, if existent. Maybe the population of non-significant trails is not the most suited for demonstrating the advantages of Bayesian statistics.

ps: When reading the section on primary outcome your message about using time-to-event instead of binary outcomes resonated well :slight_smile:

Not quite, otherwise you could just use N=2 for clinical trials. Abandon the p-value. If not using Bayes then emphasize the compatibility intervals.

Hi Frank & others,
Just want to clarify some thoughts. In the section talking about confidence Intervals (or compatible intervals), it was mentioned that if the compatibility interval included a large positive effect and a large negative effect, then you can really not make any conclusion about the treatment effect (not in the exact wordings as I remembered it, but that’s the impression I got). What if the compatibility interval is larger on one side but included the estimate of no effect? For example, for a blood pressure treatment A, say the compatibility interval was consistent with an increase of systolic blood pressure as large as 2mmHg, as well as consistent with a decrease as large as 30mmHg. One would tend to interpret this as a useful treatment (if no other options are available). I tried to choose words carefully as to separate statistical statement from clinical interpretation. Would this kind of thinking be reasonable? Or the interpretation should be the same as the example in the video that we still can’t really make any conclusions about the treatment?
Many thanks.
andy

Good question, and I hope that others weigh in. There is a danger in converting compatibility interval thinking to p-value thinking, i.e. in looking hard at whether an interval includes zero. But to your question I’d say that the data are incompatible with a large harm and not say much about either small or large benefit.

Yes, that makes sense - interpret the obvious and not over-interpreting it. Thank you.

Catching up on these lectures. Thanks again @f2harrell they are such a delight to watch.
Could I ask for follow up information on your multiplicity discussion. You describe it in terms of “backward time ordering of information” do you have any pointers to where I can read more on this view of multiplicity?

The only thing I can think of right now is this.