There is a real-life example, but not from the context of comparing two treatments. See section 1.5.1 of my book
Newcombe RG. Confidence intervals for proportions and related measures of effect size. Chapman & Hall/CRC Biostatistics Series, Taylor & Francis, Boca Raton, FL, 2012. ISBN: 978-4-4398-1278-5.
An example where a p-value really is more informative
Occasionally, the issue really is one of deciding between two hypotheses, and it would not be particularly helpful to estimate a parameter. Roberts et al (1988) reported uroporphyrinogen activity among 91 individuals in 10 kindreds affected by familial porphyria cutanea tarda. Ostensibly, families 1 to 6 showed the classic pattern in which the distribution of enzyme activity is bimodal, with low activity in about half the family members, including those already affected. Families 7 to 10 showed an apparently quite different pattern, in which no members had low levels of activity, irrespective of disease phenotype. It seemed highly plausible that the relationship between disease and enzyme was different in these kindreds, but how should this be quantified?
Several different models were fitted to the data. Each model tested the null hypothesis H0 that the association between disease and enzyme was the same in all 10 families against the alternative H1 that a different relationship applied in the two groups of families, 1 to 6 and 7 to 10. Each model fitted yielded a p-value of the order of 10^-6 indicating that H0 should be rejected in favour of H1. However, these analyses took no account of the fact that this particular split of the 10 families into two groups was on a posteriori grounds. There are 210-1=512 possible ways of segregating a set of 10 families into two groups. Clearly we do not want to consider the ‘split’ in which all 10 families are in the same group, so this reduces to 511. We can allow for the multiplicity of possible ways of segregating the 10 families into two groups here, by calculating a modified p-value p* = 1 - (1-p)^511. The effect of doing so is approximately the same as grossing up the p-value by a factor of 511. When we do so, it remains very low, at around 0.001 or below. So our data present robust evidence that the four aberrant families really do present a different relationship between enzyme and inheritance of disease.
This is an example of a situation in which a p-value is unequivocally the most appropriate quantity on which to base inferences. An interaction effect may be calculated, with a confidence interval, to compare the difference in mean log enzyme level between affected and unaffected family members between the two groups of families. However, this does not directly address the simpler issue of whether the disease should be regarded as homogeneous or heterogeneous. The p-value, adjusted in a simple way for multiplicity of comparisons, most closely addresses this issue. But this is the exception rather than the rule. Another possible example is the dysphagia assessment example in section 12.1. Also, see section 9.3 for a generalisation of single-tail-based p-values to characterise conformity to prescribed norms. Even for the drug side effect example in section 11.5.3, the most informative analysis demonstrates the existence of a plateau less than 100% by estimating it, rather than by calculating the p-value comparing the fit of plateau and non-plateau models. For nearly all the datasets referred to in this book, it is much more informative to calculate an appropriate measure of effect size with a confidence interval.
Roberts AG, Elder GH, Newcombe RG, Enriquez de Salamanca R, Munoz JJ. Heterogeneity of familial porphyria cutanea tarda. Journal of Medical Genetics 1988; 25: 669-76.