When is testing superior to estimation?

Can anybody give me a real world example where testing is superior to estimating?

Say we compare two different treatments (A and B). I find it hard to see under what circumstances testing would be superior than an estimation that gives me two posterior distributions (for A and B).

Can anyone give an example or explain what I am missing here? Thanks. Links to papers/articles/websites are also appreciated.

Edit: if this question is overly broad, please tell me and I’ll close/delete this item.

Continuing the discussion from Most reasonably hypothesised effects cannot be exactly zero?:


I think this is an excellent question. My career has been devoted to clinical trials and health services/outcomes research. Even though I use hypothesis testing for expediency I do not recall a single example where hypothesis testing was the best way to meet a research project’s goals. Since I became Bayesian, what I want to see is the entire posterior distribution for the unknown effect, the probability that the effect is in the right direction, and the probability that the effect is more than trivial, for a reasonable choice of ‘trivial’. More on that here.


what about drug development with generics. i guess equivalence testing is estimation but really we don’t care what the estimate is along as it falls within certain boundaries, thus it is the statistical testing that matters, a kind of go v no go

I don’t see that. No/no go decisions need to also make use of evidence for magnitude of effects.

The only advantage I can think of is a psychological one: the fact that the conclusion was the result of a procedure makes it look more ‘objective’. “Our magic test says that Treatment A is superior” sounds more convincing to some than “Looking at the posterior distribution one sees that it is more probably that Treatment A is superior”.


Though it is possible to use such testing wisely, in reality a procedure that results in positive/negative reject/not reject etc. provides only the illusion of objectivity and is in effect attempting to move the thinking component from the researcher to a black box.


Still puzzles me how/why NHST got its prominent place in science and medicine. The method is (afaik) good for:

  • frequent decision making
  • in a closed context
  • where an occasional wrong decision is not fatal
  • but where one wants to minimize the number of mistakes

Example: quality control in a screw factory. Based on samples the quality manager decides if a batch of screws is deemed “good” or “faulty”. Minimize Type I and Type II errors; but a single wrong decision is not fatal.

Contrast this to science in general.

  • goal is not to take decisions but to gain knowlegde
  • context is usually open. It’s not just H0 vs H1.

Or, when we go to decision making, let’s look at medical decisions:

  • many once in a lifetime decisions (have a bone marrow transplant yes or no?)
  • long term error rates are not relevant, each individual decision counts. “I don’t care that 80% of all patiënts survive, I want to know your best calculations of MY chances of survival”
  • and of course a patient is only interested in the relevant probability: the probability of the hypothesis given the data. “Given these data, there is a 95% probability that you have this disease” . Rather than in the probability of the data given some hypothesis.

As you all probably know, many of these objections were worded long ago, e.g… by Sir Ronald Fisher in 1955. https://www.phil.vt.edu/dmayo/personal_website/Fisher-1955.pdf

Or am I missing the point here? Thnx.


I don’t think you are missing the point. But even with the screw factory, the question is “how good”, and estimation seems in order. I’m not convinced that a point null is the way to go. A Bayesian analysis would emphasize the probability that the parameter representing “bad” exceeds a given tolerance.

1 Like

There is a real-life example, but not from the context of comparing two treatments. See section 1.5.1 of my book

Newcombe RG. Confidence intervals for proportions and related measures of effect size. Chapman & Hall/CRC Biostatistics Series, Taylor & Francis, Boca Raton, FL, 2012. ISBN: 978-4-4398-1278-5.

An example where a p-value really is more informative
Occasionally, the issue really is one of deciding between two hypotheses, and it would not be particularly helpful to estimate a parameter. Roberts et al (1988) reported uroporphyrinogen activity among 91 individuals in 10 kindreds affected by familial porphyria cutanea tarda. Ostensibly, families 1 to 6 showed the classic pattern in which the distribution of enzyme activity is bimodal, with low activity in about half the family members, including those already affected. Families 7 to 10 showed an apparently quite different pattern, in which no members had low levels of activity, irrespective of disease phenotype. It seemed highly plausible that the relationship between disease and enzyme was different in these kindreds, but how should this be quantified?
Several different models were fitted to the data. Each model tested the null hypothesis H0 that the association between disease and enzyme was the same in all 10 families against the alternative H1 that a different relationship applied in the two groups of families, 1 to 6 and 7 to 10. Each model fitted yielded a p-value of the order of 10^-6 indicating that H0 should be rejected in favour of H1. However, these analyses took no account of the fact that this particular split of the 10 families into two groups was on a posteriori grounds. There are 210-1=512 possible ways of segregating a set of 10 families into two groups. Clearly we do not want to consider the ‘split’ in which all 10 families are in the same group, so this reduces to 511. We can allow for the multiplicity of possible ways of segregating the 10 families into two groups here, by calculating a modified p-value p* = 1 - (1-p)^511. The effect of doing so is approximately the same as grossing up the p-value by a factor of 511. When we do so, it remains very low, at around 0.001 or below. So our data present robust evidence that the four aberrant families really do present a different relationship between enzyme and inheritance of disease.
This is an example of a situation in which a p-value is unequivocally the most appropriate quantity on which to base inferences. An interaction effect may be calculated, with a confidence interval, to compare the difference in mean log enzyme level between affected and unaffected family members between the two groups of families. However, this does not directly address the simpler issue of whether the disease should be regarded as homogeneous or heterogeneous. The p-value, adjusted in a simple way for multiplicity of comparisons, most closely addresses this issue. But this is the exception rather than the rule. Another possible example is the dysphagia assessment example in section 12.1. Also, see section 9.3 for a generalisation of single-tail-based p-values to characterise conformity to prescribed norms. Even for the drug side effect example in section 11.5.3, the most informative analysis demonstrates the existence of a plateau less than 100% by estimating it, rather than by calculating the p-value comparing the fit of plateau and non-plateau models. For nearly all the datasets referred to in this book, it is much more informative to calculate an appropriate measure of effect size with a confidence interval.
Roberts AG, Elder GH, Newcombe RG, Enriquez de Salamanca R, Munoz JJ. Heterogeneity of familial porphyria cutanea tarda. Journal of Medical Genetics 1988; 25: 669-76.

1 Like

Thanks very much for this example. But I’m still not convinced. I’d rather have a continuous measure of per-family distribution shape, degree of bimodality, dispersion, etc. I want to describe distribution characteristics along a continuum, and obtain measures of uncertainty for same. What is the dispersion across families of within-family dispersion, for example?