Choosing statistical paradigms by studying the quality of decisions to which they lead

The overarching goal of statistics is to make decisions in the face of uncertainty. Debates continue to rage about how to actually do this, with some of the choices being

  • frequentist null hypothesis testing
  • frequentist confidence limits
  • likelihood ratios (the likelihood school is similar to Bayes without priors)
  • likelihood support intervals (using relative likelihood)
  • Bayesian hypothesis tests
  • Bayes factors
  • Bayesian posterior probabilities (from subjective, skeptical, or objective Bayes)
  • Full Bayes decisions, maximizing expected utility (an integral of the posterior distribution and the utility function)

What seems to be missing is head-to-head comparisons of approaches to see which ones optimize utility/loss/cost functions where such functions reflect real, concrete goals.

This is not a comparative study, but Don Berry had a wonderful paper showing how to design a vaccine clinical trial for a Native American reservation in which the objective function was to maximize health of the entire reservation, not just those persons enrolled in the trial.

Does anyone know of comparative studies that inform us of the value of two or more statistical approaches when the goal is making the best decisions?

:new: See this NY Times article for a nice non-statistical description of decision theory.

there’s another don berry paper in stats in med 1993: paper

i haven’t read it in years but the abstract says “This paper describes a Bayesian approach to the design and analysis of clinical trials, and compares it with the frequentist approach.”

If only there were :slight_smile:
I suspect that there will never be enough information to do such a comparison.

That’s a pity, given the arguments that have raged round false positive risks. It’s a really important topic, but statisticians have had roughly zero effect on practice, at least for studies that don’t have the benefit of a professional statistician (that’s the vast majority of them).

As a follow-up question, is there much work on assessing or formalizing current decision-making practice and utility trade-offs (for example, between risk of side effects and chance of improvement) currently made by physicians? Is there a sense in which risk preferences are also tailored to patient preferences/does this appear in data?

So glad you joined the site David. On your “not enough information” point, I wonder if even somewhat artificial, stylized simulation studies would help. They would have to be convincing about the utility function.

There’s a good deal of this in the medical decision making literature, e.g. Steve Paulker’s work. But it’s not done often enough. I did see an excellent utility analysis by the Duke Clinical Research Institute on anticoagulant treatment for atrial fibrillation.

Certainly there is a long history of testing rival estimators by simulation. My 3rd real paper, in 1969, was about comparing estimators for the two-parameter Langmuir binding curve In those days the simulations The simulations were run on an Elliott 803, using Algol, on punched paper tape. Now museum pieces.
http://www.onemol.org.uk/colquhoun-lsfits-1969.pdf

Much later I came back to a similar problem but this time with a very complicated likelihood function and 14 free parameters, http://www.onemol.org.uk/c-hatton-hawkes-03.pdf

In these cases there was a physics-based model for the observations, so simulation could generate realistic ‘data’. In the case of clinical trials, there is no physical model for the data, so the realism of simulated ‘observations’ must be to some extent in doubt. Similarly loss functions must, I’d imagine, be pretty ill-defined in most cases. Nevertheless it might be interesting to test a range of values.

1 Like

It would really useful if someone could settle the much simpler question of which of the methods that you list at the start predicts correctly you risk of making a fool of yourself by claiming that you’ve discovered an effect of an intervention when in fact there is nothing but chance behind the observations. I tried to address this much simpler question by simulation in 2014 http://rsos.royalsocietypublishing.org/content/1/3/140216 Needless to say, not everyone agreed with the assumptions that I made, despite their being pretty conventional (point null, simple alternative). One conclusion was that if you assumed a prior of 0.5 then an observation of a p value of 0.0475 would imply a false positive risk of 26%. If that is right it has serious consequences for the reproducibility crisis. Other people contest this value. If we can’t get consensus in answering such a simple question, what hope is there for the much more complex problem that you have posed?
[More recent stuff listed at http://www.onemol.org.uk/?page_id=456 ]

1 Like

I tend to not want to put a point probability mass at exactly zero.

I still cling to the idea that there are realistic examples where we can study highly understandable consequences of decisions/loss functions.

i’ve see bayesian re-analysis (rather than data simulations) regarding when to stop a trial. Eg i was reading this today: paper1 and there are others: paper2
some say we should evaluate the operating characteristics of the design; does this imply bayesian and frequentists decisions regarding when to stop will be synchronised?

I realise that Bayesians tend not to like the point null, I’ve never been able to understand why. It has a long history in statistics and it makes perfect sense to experimenters. One reason for using it is that it’s familiar to users. If practice is ever to be changed, it’s really important to stick to ideas that experimenters can understand (I’m talking about users who don’t have professional statistical help).
I tried to justify this point of view in section 4 of https://arxiv.org/ftp/arxiv/papers/1802/1802.04888.pdf
Please feel free to tear it down. All opinions are welcome.

2 Likes

I don’t really think that prior distributions with a discontinuity at zero have a long history except for cases in which writers are attempting to mimic null hypothesis testing. There are reasons why I personally don’t believe there should be discontinuities in priors.

  1. It does not reflect the true state of knowledge except in rare cases, usually to have to do with physics such as existence of ESP.
  2. I don’t believe in statistical tests in general (even though I still use them for expediency, I’ll admit), and I don’t like tests against a single point that results in a researcher arguing for any other effect value they please.
  3. When you believe that prior knowledge is such that there should be a nonzero probability of exactly one value of the unknown parameter, the entire analysis is too sensitive to the probability you assign to that one value. A probability of 0.5 is often used by default, without fully thinking through its implications.
  4. In medicine, the null hypothesis cannot strongly be argued to be exactly true in any case I know of, and John Tukey stated that it’s safest to assume that null hypotheses are never true. So if the null hypothesis is that a effect is exactly zero, don’t even bother to collect data.
  5. I think we will make better decisions by just considering the probabiity that an effect is positive, where a failed treatment can either be one with no effect or have an effect in the wrong direction. Null hypotheses seem to forget that treatments can actually harm patients, not just not help them.

You are using slightly indirect reasoning IMHO. I don’t really want to know the chance that I was fooled when I believed something. I want to know the chance that I should believe something or the chance that something is true.

They will not be synchronized. Frequentist sequential analysis pays a price for multiple looks, which makes the frequentist approach conservative. Expected sample size is minimized by looking at the data as often as possible with Bayes.

I know that statisticians often say this, but I don’t think that it makes much sense to experimenters. If you test a series of drug congeners you would be very lucky if 10 percent of them had appreciable beneficial effects, never mind 50 percent. Ben Goldacre put it a bit differently when he pointed out that most bright ideas turn out to be wrong/ I don’t know of any attempt to get numerical estimates but I’d guess that in a majority of experiments the effects are close to zero.

I think that there are two ways to deal with the fact that we don’t know the prior. One is to calculate the prior (for a specified false positive risk). The other way is to calculate the false positive risk (FPR) for a prior of 0.5 -this I call the minimum FPR, on the grounds that it would be foolish (in the absence of hard evidence) to suppose that the chances of your hypothesis being right was any better than 50:50.

At first I preferred the Reverse Bayes approach -calculate the prior. But it has two disadvantages for non-expert users. 1. the idea of a prior is unfamiliar to most people, and people are suspicious about subjective probabilities, And 2. It runs the risk of FPR = 0.05 becoming another bright line, in place of p = 0.05.

Although accompanying the p value with the minimum FPR would still be overoptimistic for hypotheses with priors less than 0.5, it would still be a great improvement on present practice, IMO.

I could not possibly disagree more. The vast majority of drugs have small nonzero benefits.

1 Like

This is not consistent with probabilistic thinking that leads to optimal decisions. I simply want to know, with direct Bayesian modeling, the probability that the drug effect is in the right direction given the data and my prior. You’re making it too complex IMHO.

1 Like
  1. How on earth can you know that “vast majority of drugs have small nonzero benefits”? I’m ceratinly not aware of any reason to believe that.
  2. a small non-zero effect might as well be zero for practical purposes. And neither does it invalidate use of a point null (Berger and Delampaday. (1987)
1 Like

That’s pretty much identical with my aim. I want to know the Bayesian probability that there is a non-zero effect. (I can easily see it it’s in the right diirection by looking at the sign).

1 Like