Musing on how to design "statistical functions/methods 2.0"


(Applied) statisticians and knowledgeable consumers of statistical methods know that most (if not all) methods come at a price - they have assumptions.

Unfortunately not all consumers of statistical methods - or indeed their products! aka publications and “bold” media/press releases to the public (thinking of, but not restricted to, the tonnes of Covid-19 (pre-)publications being churned out at “less stringent” aka “accelerated publication track” peer-review processes out there:

*(1) Know about the assumptions of, or if they do. *
*(2) Know about how to test for & interpret those assumptions, and *
(3) Once tested, if they fail to meet those assumptions know where to go to next.

We, largely but not exclusively, have the point-and-click statistical software to thank for that!


As you all know, there is an entire universe on product safety and labelling legal requirements in most parts of the world:

If you buy a bottle of wine (most of them at least), there will be a footnote telling you it contains sulphites etc.

If you buy a cigarette, you not only get an alphabetical warning, but a lovely graphical accompaniment.

If you get a drug, they come with information on side-effects etc.

If you buy a car, buy default safety features such as safety belt are a default instead of an added option.

However, if you run a statistical function/method e.g. cox/logistic regression, you are expected to also find your way (assuming that you can) on how to test for the assumptions.

In my (albeit naive) view, I consider that statistical methods to be the gatekeepers (i.e. even more important!) of safety (in terms of “novel findings”, "confirmatory/replication research"and “policy-changing studies” within research.

Statistical functions/methods 2.0?

Therefore, how about, inverting how we design and implement statistical products?: instead of the designers and programmers of statistical programs/functions passing on the burden of testing for model assumptions to their consumers, they take responsibility and test this as the DEFAULT, and then make it optional for consumers to turn this off - i.e. the consumer has to work a little harder to turn off this part of the statistical function output - increasing the chances of them having read the output in the first place as opposed to the other way round?

Yes, we can say that, “consumers of statistical methods or the products resulting thereof need to be aware of these conditions prior, or need to consult a statistician” but that is far removed from an ideal world - we have seen published (and luckily for some minority, retracted) papers that don’t even bother to mention whether they tested for model assumptions, yet lots of works build upon those findings (especially if they are the “first” or the “largest” or worse “first and largest” study to show an effect.

An extreme example might be to let industry decide to pass on the burden of “consumer be ware” to the customers, thus not being required to print on the packaging of their products “Cigarette/alcohol/drug might cause …” but rather expecting consumers to find these out for themselves or be aware of them or consult experts to tell them about that.

Of course, there is no silver bullet that can solve bad research, but this might be one way to mitigate this: no consumer should say (1) I did not know about the assumptions OR (2) I did not how to test for &/or interpret those assumptions.

Condition might be an optional recommendation to statistical software designers to implement i.e. (3) Once tested, if they fail to meet those assumptions know where to go to next.

What do you think? What other aspects could be improved on?


I am just a junior medic, with limited (non-existent actually, compared to most people on this forum) statistical/programming knowledge & experience but with tremendous respect and admiration for the field of statistics and how it impacts our world and medical practice.


there’s a certain pragmatism where assumptions are ignored by convention eg cox regression is used as default when a shared frailty model would acknowledge the non-idependence of siblings etc. Or you see in the best medical journals researchers contort their data to apply one of the few stats tools they know eg GEE modelling or Cox. We ought to expect some devlopments in stats software in the way you describe, but it seems we are going the other way, moving to more rarefied software such that modifications to the analysis will require access to a statistical technician, and these people are scarce and overworked in academia. Let’s hope for some new software to change habits …


RE: Assumptions behind widely used statistical procedures.

Classical Nonparametric methods should have minimized the problems of validity of assumptions mentioned in the OP, but very few introductory applied texts have presented the underlying decision analysis that examines their properties relative to the more common parametric approaches.

I particularly like the intro stats book written by Gottfried Noether: Introduction to Statisics: The Nonparametric Way, which compares and contrasts frequentist parametric and nonparametric theory, and comes to the same conclusions you can find in BBR.

Suffice it to say – I was very happy to see a top statistical expert (our host Dr. Harrell) combine the advantages of nonparametrics with the Bayesian POV. I was not clear on how to reconcile the 2 approaches. I’ve learned a lot from studying his posts and following up on his citations.


I’ve always loved Noether’s book, but also think that a model-based approach leads to the most downstream possibilities.