I’ve found @Sander_Greenland many papers on this issue helpful. I’ve posted a number of open access links to his papers in this thread. For example:
A p-value, from a pure logical perspective, is a continuous, quantitative measure of compatibility or surprise from an asserted, conjectural value. After data collection, it is a percentile in the hypothetical distribution. Where this hypothetical distribution comes from depends on the context. It can be from a population model, or as you mention in 2, second bullet point above
Blockquote
single non-random recruitment of a group of subjects, followed by repeated randomization of these same subjects to one treatment or another (i.e., multiple experiments conducted on the same group of subjects)?
This is exactly how permutation tests work. Under a strong default assumption of no effect (and an assumption the groups are exchangeable), re-label the data to compute an empirical no effect (null) distribution, and compare your observed results to this distribution. There are no modelling assumptions here (the assumptions are accepted as true because they are constructed that way by design) which makes it a powerful method of detecting effects.