A problem here is that we are making proposals with no experimental evidence to back up our reasoning about what would be best to do within research realities. Hence we can only appeal to reasoning and experiential anecdotes, a combination which (as anyone involved in real-world applications can attest) is at best used with caution until more controlled study is available.
Here’s my anecdote for today: What you wrote is the same argument I got from professors when I was a student a half-century ago and complained that the statistics were failing to capture uncertainty: Their answer was always along the lines of “we all know that, but they do capture uncertainty within [or given] the model”; some would even mention this in lectures.The half-century since showed that this rationalization for applying “uncertainty” to such vastly incomplete measures of uncertainty did nothing to stop the treatment of conventional statistics as if they were capturing all uncertainty.
Not that all researchers made that mistake, but vast amounts of the literature proceeded as if p<0.05 or 95% CI excluding the null was grounds for having high certainty about an effect (and which editors used as a criterion publication), and also proceeded as if having p>0.05 or 95% CI including the null was grounds for having high certainty there was no effect. So by the beginning of the present century I could only conclude the kind of proposal you presented (which has a long history) had failed very badly in practice.
This conclusion led me first into promotion of multiple-bias analysis, which tries to enter into the analysis model all uncertainty sources seen as potentially important, as discussed in Greenland, S. (2005). Multiple-bias modeling for analysis of observational data, with discussion. Journal of the Royal Statistical Society, Series A, 168, 267-308. As some discussants noted however and it soon became clear, that approach required far too much effort and sophistication to ensure proper use by most researchers.
I thus turned to how to improve interpretation of conventional statistics, which was plagued by misinterpretations such as those catalogued in
Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.C., Poole, C., Goodman, S.N., Altman, D.G. (2016). Statistical tests, confidence intervals, and power: A guide to misinterpretations. The American Statistician, 70, supplement 1, 1-12 at Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations on JSTOR and at
https://amstat.tandfonline.com/doi/suppl/10.1080/00031305.2016.1154108/suppl_file/utas_a_1154108_sm5368.pdf,
The authors of that diverged about how best to proceed, ranging from some arguing conservatively (along your lines) for retaining traditional treatments stated with more care, to some arguing for complete change to Bayesian methods.
One reform item they all could accept was resurrecting the original more modest compatibility interpretations of conventional statistics, as seen long before Fisher, Neyman, and others fused statistical reporting with the modern statistical theory of survey and experimental design. The premodern exemplars seemed content to use P-values and interval estimates in a role describing the relation of data to models and to hypothesis within models, as with Pearson’s use of compatibility when presenting goodness-of-fit P-values (a descriptor that Fisher also used at times)*.
In that premodern era, only a rarified elite participated in research and reporting, and they were under none of the intense time and funding pressures that became the norm in the second half of the 20th century. Yet even in that era we can find complaints about misinterpretation of large P-values (“nonsignificance”) as if those supported there being no difference*. Thus we should not be surprised that misinterpretations became the norm when, after WWII, science opened up to a flood of newly minted researchers subject to the pressures of modern academia, including pressures to downplay the uncertainties about their own results.
In sum, writers have been lamenting these misinterpretations for well over century, and the old advice to retain usage of “uncertainty given the model” appears to have done nothing to stop over-interpretations of conventional statistics. Hence my push to take the not-so-radical step of retreating to the more modest treatments of statistics found before the rigidly formalized approaches of Neyman and others took control of theory, and well before the caricatures of those approaches called “NHST” and “confidence interval estimation” took control of practice. How much this reform effort will improve reporting will only be seen if it is adopted in place of traditional treatments that insist on treating conventional statistics as uncertainty measures.
*For citations central to this history, see sec. 3 of
Greenland, S. (2023). Divergence vs. decision P-values: A distinction worth making in theory and keeping in practice (with discussion). Scandinavian Journal of Statistics, 50(1), 1-35, https://arxiv.org/ftp/arxiv/papers/2301/2301.02478.pdf, https://onlinelibrary.wiley.com/doi/10.1111/sjos.12625, discussion 50(3), 899-933, corrigendum 51(1), 425.