The textbooks don’t really go over this so well, but the “hypothesis test” is a procedure (where p < \alpha) to make an educated guess about a course of action with a fixed error probability. This can be useful in the context of the “inductive behavior” of an individual agent, but the value of these individual decisions as evidence measures for a community of scientists, is questionable.
Another point: the textbook calculations condition on the null (act as if it is true), but we should not assume that this null is in any sense “true” (outside of certain special contexts). The null is a point of reference that can be made to (approximately) hold via randomization. A purist Bayesian might question this need for randomization, but will still comprehend the result.
A 2 sided p value can be interpreted as a measure of surprise — indirect evidence for an alternative (ie. the observed sign is the “true” sign of the population). This continuous interpretation of p values permits different studies to be combined, and is more fitting of the needs of a community of scientists.
Some key papers on p values I’ve found helpful:
https://www.sciencedirect.com/science/article/pii/S002224961600002X
The authors give a Bayesian perspective on p values that is critical for proper interpretation.
Senn describes Fisher’s motivation for calculating p values vs. the Bayesian use of prior distributions. For this simple problem, the amount of prior weight on the null makes a world of difference when testing a precise null, vs. a dividing one. But for frequentist calculations, it doesn’t matter all that much.
P values, S-values, and relations to measures of information
The link to Sander Greenland and Zad Rafi’s paper is also critical, and connects statistical methods to information theory. More useful than a single \alpha level or p value for a specific parameter is the set of intervals where \alpha ranges from 0 to 1. The set of “confidence intervals” has the equally misleading name of “confidence distribution”.
https://rss.onlinelibrary.wiley.com/doi/10.1111/rssa.12647
Shafer’s recent paper on framing hypothesis testing as gambling is a useful technique to think about statistical problems and methods.
FWIW, I like the perspective Shafer takes, but am not terribly enthusiastic about his proposed measure to do away with p values. A similar game semantic framework can be developed using the Greenland/Rafi s-value, along with Berger et. al Bayesian machinery.
After study of the first 2 papers, you will find that none of the commonly used “significance levels” constitute “strong evidence” when interpreted as Bayes factors or bits of information.