I’ll be interested in Frank’s reply. Mine:
- I would never limit a vague term like “evidence” to model comparisons, especially when the bulk of Fisherian literature (including Fisher) refers to small p as evidence against its generative model - and it is evidence by any ordinary-language meaning of “evidence”, with no single comparator.
- I would also modify your phrase to say “surprising if the assumed model had generated the data”, to emphasize the hypothetical nature of the model - in ‘soft-science’ modeling we rarely have “true” models in the ordinary meaning of “true” and so labeling them such invites confusion; but we can still do thought experiments with models to gauge how far off the data are from them, which is the source of P-values and frequentist arguments.
- If p<1 there are always “better models” such as saturated models which fit the data perfectly, having p=1 and likelihood=maximum possible under background constraints. I think this illustrates how P-values and likelihoods are insufficient for model selection (at the very least we need penalization for complexity, as in AIC, and for implausibility, as with informative priors).
- Finally, a point needing clarification: Why do you want to convert a P-value in the way you indicated? If you have an alternative comparable to your test model, take the P-value and S-value for the alternative to get the information against that alternative. And if you want to compare models, then use a measure for that purpose (which again is not a P-value; a model comparison needs to account for complexity and plausibility).