Language for communicating frequentist results about treatment effects

The meaning of words tends to change over time and it may be that ‘significant’ was initially used by statisticians with the second definition as the intended meaning. But to effectively communicate what we are thinking it would seem best to use words that will be interpreted with the meaning we intend.

Unfortunately, we now have to combat the widespread assumption of definition 3 from that web page: 3. Statistics. of or relating to observations that are unlikely to occur by chance and that therefore indicate a systematic cause.

1 Like

Words are often used with slightly different meanings by different people and I doubt there is anything that can change this. Hence, ‘significant’ to one person might suggest more certainty than it does to another person.

I personally like to use terms such as ‘weak evidence’, strong evidence’, ‘very weak evidence’, etc to describe the results of an analysis. For example, saying that ‘very weak evidence’ has been found for an effect is one way I try to convey the true uncertainty of a result when there really is a lot of uncertainty.

It is certainly difficult to convey uncertainty in a way that will leave the reader appropriately uncertain!

Define “evidence”… it’s a much more complicated thing than one might expect.
Substituting one misunderstood word by another may not be a good solution. I see a similar problem with Frank’s suggestion to give a Bayesian posterior probability (or odds) for the direction of an effect. Probabitity is a beast, as even statisticians seem to disagree on its meaning, and I foresee that statements like “there is a 92% probability of a positive effect” will (wrongly) be taken to mean “in 92 out of 100 such studies, one would observe a positive effect” will be …

That is a good point. In the context of summarising a study’s results and giving my own interpretation, the meaning I intend by the word “evidence” here is along the lines of “some information that might be combined with other relevant information that can then potentially be used to help guide a decision”.

The meaning I intend is intentionally vague with the hope that it implies some uncertainty is present in the results, and is less likely to be interpreted otherwise, at least compared to other words that might be used. But I could be wrong and will always consider alternative viewpoints on how words might be commonly interpreted.

Nevertheless, those who study semantics or ‘meaning in language’ point out that words are often ambiguous or vague in their meaning. Hence, interpretations of study conclusions will often vary in terms of the strength of evidence implied, at least a little.

In fact, this feature of language may have evolved because of a necessary tradeoff between our need to communicate quickly and easily and our need for the information to be accurately interpreted (see hdl.handle.net/1721.1/102465). As a result, communication relies heavily on context and often with some remaining ambiguity. Put another way, a lot of meaning in language is inferred from the context, and these inferences will depend on the reader’s knowledge, past experiences, opinions, etc.

1 Like

Yes, I agree. Particularily to the note that language is something dynamic and that words can be vague, ambigious, and that their meaning is usually inferred from the context. But… this is the reason why we should use technical terminology (jargon) with well-defined words (strictly given context, narrow and explicit meaning). It is difficult, particularily in fields that cross different disciplines (like medicine, crossing anatomy, physiology, biochemistry, chemistry, physics and mathematics). Each discipline developed its own jargon, and there are many pitfalls like similar words with different meanings in the different disciplines (and even in sub-disciplines). The term “statistical significance” is, to my knowledge, well defined in statistics, and this jargon-specific definition was just ignored outside the discipline, what I think is one of the reasons of the current fuss. Sadly, even scientists of a particular discipline don’t use (or understand) their own jargon well enough. Deinitions of terms or concepts are unknown or misunderstood (example: the majority of stats teachers in phychology that were asked for the difinition of a p-value gave wrong answers [Haller & Krauss, 2002]).

We spent a lot of time fighting for “right” and “wrong” ways of doing analyses and interpretating results. I think we should first get our jargon right. Many of the problems will dissapear, and the relevant questions we have to solve will become much clearer.

That would be a major distortion of what probability means for one-time events. To me we need to spend our effort educating everyone about probability, then quit using inference and so many ill-defined words to state results. To me it’s about playing the odds, if we’re not ready for formal decision analysis. Let the reader action such short statements as “the treatment probably (0.92) lowers blood pressure”.

Also if the study would actually be replicated. Seeing probability as some kind of frequency (either as actual or as hypothetical [limiting] frequency) is a faulty concept that eventually leads to logical problems.

See:
Alan Hájek, 1996
Alan Hájek, 2009

4 Likes

For anybody wondering how this little experiment went, here are some excerpts from the rejection (obviously) letter.

Statistical analyses should have been undertaken to investigate if any differences between groups were observed at baseline

in Fig 3, it would be better if results were shown with a focus only on where significant differences lie.

Although conclusions accurately reflect findings, unfortunately there is a lack of discussion/focus on data that show any statistically significant differences.

In summary, this study shows a number of original and novel features regarding study design, aims and aspects of methodology however does not properly highlight or discuss in depth data which is of statistical significance.

So, no considerations were given to the fact that this is a pilot study and p-values can’t really be used for much in the first place (which we state explicitly with several references in the methods section), and we are asked to focus on statistically significant findings. Back to square one then.

2 Likes

Too bad the reviewers are so entrenched in the status quo.

2 Likes

Yes, but I am not really surprised and expect some opposition in this for some time. We’ll see how it goes in the next round, but we have made sure to make these points even clearer now.

3 Likes

I wanted to bump this older, but still very important thread with a link to a recently published paper by @zad and @Sander on the language used to discuss classical textbook frequentist methods that are so routinely misused, it lead to this discussion. An earlier draft was mentioned above, but it looks like important improvements have been made, that make it worth further study.

5 Likes

Would the following be considered valid for insertion in a paper for example?
Compared to A, the estimated effect of B indicated improvement in blood pressure (mean difference -4; 95%CI -13, 5) but with limited evidence against our model hypothesis of equivalence between A and B.

Should “limited evidence” refer only to the P value or also to clinical importance of the difference? If the mean difference was say -12mmHg (-30, 6) and P=0.2 would this statement need any additions?

Good questions. There can be strong evidence for any treatment effect but weak evidence for a noticeable clinical effect, so being explicit is key. If the minimal clinical effect of interest is -4mmHg and the prior distribution is smooth and fairly symmetric, with your example data having a difference of -4, the probability of clinically meaningful or greater efficacy is near 0.5, which helps put the finding in context. But to your questions, it would be nice if we didn’t have to include the p-value. But assuming the journal forces us to include it, your initial statement is good, or you could say: Compared to A, we observed improvement in blood pressure (mean difference -4; 0.95 CI [-13, 5]mmHg) but with limited evidence (p=0.2, assuming our statistical model) against the supposition of zero difference at the current sample size.

Note the frequentist limitation of having to concentrate on evidence against vs. evidence for. I would rather see \Pr(\Delta < 0), \Pr(\Delta < -4), \Pr(-4 \leq \Delta \leq 4) with transparency about the prior.

2 Likes

Thanks, yes, the current sample size is important so that clarifies things. Will see how the clinical journal editors respond - hopefully they will accept (with the covering letter as above) and not insist on the status quo.

What do you think about limiting the use of the term “evidence” strictly to the discussion of relative fit of 2 models to data (as is done via the likelihood ratio) and refer to the information contained in the p-value as “surprisal” as @sander recommends? I’m beginning to think that would clear up much confusion. If I had to describe results with high surprisals, I’d use the language “sufficiently surprising if the assumed model is true.”

AFAICT, the surprisal is inversely associated to the likelihood ratio in the simple case of an asserted test hypothesis H_a and its complement H_\neg{a} in the sense that a high surprisal (low p) indicates that there (probably) exists a better model (model with higher likelihood) for the current data.

The term “surprisal” works when p-values are scaled to units of information relative to the assumed model, but I struggle to come up with a term when converting them to units that assume the alternative (ie. \Phi(p)).

1 Like

I’ll be interested in Frank’s reply. Mine:

  1. I would never limit a vague term like “evidence” to model comparisons, especially when the bulk of Fisherian literature (including Fisher) refers to small p as evidence against its generative model - and it is evidence by any ordinary-language meaning of “evidence”, with no single comparator.
  2. I would also modify your phrase to say “surprising if the assumed model had generated the data”, to emphasize the hypothetical nature of the model - in ‘soft-science’ modeling we rarely have “true” models in the ordinary meaning of “true” and so labeling them such invites confusion; but we can still do thought experiments with models to gauge how far off the data are from them, which is the source of P-values and frequentist arguments.
  3. If p<1 there are always “better models” such as saturated models which fit the data perfectly, having p=1 and likelihood=maximum possible under background constraints. I think this illustrates how P-values and likelihoods are insufficient for model selection (at the very least we need penalization for complexity, as in AIC, and for implausibility, as with informative priors).
  1. Finally, a point needing clarification: Why do you want to convert a P-value in the way you indicated? If you have an alternative comparable to your test model, take the P-value and S-value for the alternative to get the information against that alternative. And if you want to compare models, then use a measure for that purpose (which again is not a P-value; a model comparison needs to account for complexity and plausibility).
3 Likes

Blockquote
I would never limit a vague term like “evidence” to model comparisons, especially when the bulk of Fisherian literature (including Fisher) refers to small p as evidence against its generative model - and it is evidence by any ordinary-language meaning of “evidence”, with no single comparator.

While I find Richard Royall’s arguments for likelihood pretty strong, the distinction between “surprise” and “evidence” is made by others who work with Fisher’s methods.

The term “evidence” seems overloaded when it refers to p-values and likelihoods. It leads to unproductive discussions on how p-values “overstate” the evidence. If a distinction is made between surprise and direct measures of evidence via likelihood or Bayes’ Factors, we can prevent the p-value from being blamed for a simple mistake in interpretation.

Blockquote
Finally, a point needing clarification: Why do you want to convert a P-value in the way you indicated?

Kulinskaya, Morgenthaler, and Staudte wrote a text on meta-analysis that emphasizes the transformation of the p-value or t-statistic to the probit scale via the relation \Phi(p), where \Phi is the standard normal cumulative distribution function. Also writing in the Fisher tradition, they make the distinction between p-values as measures of surprise, and “evidence”, which they measure in probits, connecting Fisher and the Neyman-Pearson school of thought.

On another thread, you posted a link to a video where you, Jim Berger, and Robert Matthews discussed p-values. I’ve found all of the suggestions – the -e \times p \times ln(p) bound, the Bayesian Rejection Ratio, the S-value, and the Reverse Bayes’ method, all valuable.

I’ve given that much thought on how to use these techniques when reading not only primary studies, but meta-analyses where aggregate effect sizes are naively computed. I posted a re-examination of a recent meta-analysis of predicting athletes at higher risk for ACL injury in this thread:

I like everything that’s being said. And I like Sander’s distinction of model from the model. I like surprise or relative evidence as terms. In terms of how p-values overstate (relative) evidence, we need to footnote p-values thus: P-values are probabilities about data extremes under a putative generative model, not about effects.

3 Likes

I recently thought of this, as well, and so am glad to see your independent—and earlier—advocacy to use the term “discernible”! Here’s my own take: “Significant? You Really Mean Discernible | Two common wrong phrases about statistical significance” https://towardsdatascience.com/significant-you-really-mean-detectable-b3e8819e3491?sk=e521f9c6fb98e7a953876581c58f10e5

I don’t think that the author of the link you referenced does a much better job. In his “correct interpretations” he still confuses that statistical significance is about an effect. It is not. It is about the data under an assumed model (with a particular restriction of that “effect” [which might be just zero, or any other value]). He could write that the observed data (or a test statistic) are statistically significant under the restricted model. It’s the first step into the trap to confuse the observed data or an observed test statistic (be it a difference or an odds ratio or a slope) with a model parameter. It’s not, and it cannot be, the parameter value that is statistically significant.

This misconception seems to lead the author to the wrong (well, at least problematic) statement that a statistically significant result would indicate that the statistic “calculated from the sample is a good estimate of the true, unknown [effect]” (emphasis mine). It is not. The quality of the estimation is not addressed by a test, or it is addressed only to that minimal requirement that the estimate is discernible from the value at what the model parameter was resticted to. This is no indication of a “good” or “bad” estimate. It’s simple good enough to decide whether we may claim that the value of the parameter is above or below the resticted value (what may be “good enough” for the purpose, though). The actual point estimate can still be way off the mark (something that could better be assessed by a confidence [or compatibility] interval.

Then I see another problem in the part “Detailed Reasoning”. Here the author more correctly states that the p-value is statistic calculated from the data under the restriction (although he is not very clear about that), but then he writes that the p-value is just an approximation. To what? As the data can be seen as realization of a random variable, the p-value calculated from such data can be seen as the realization of a random variable, let’s call this variable P. If the test is correctly specified, we know that the distribution of P is uniform under the resticted model, and that it is preferentially right skewed under the desired violation of that model. This is all we can say. Any observed p value is just a realization (not an approximation) of P - but we don’t know the distribution of P. A p close to zero is taken as evidence that the distribution of P is right-skewed and that the restricted model is violated / inappropriate to describe the data. Particularily, under the correctly specified model, the p values are expected to wildly jump around over the entire domain of P (between 0 and 1), no matter how large the sample size is. There is nothing it “approximates”. Only if the model is violated in the desired way, we might see the p-value as an approximation - of the value zero (this is the limiting value that would be approached for the sample size going to infinity).

3 Likes