Has this JAMA Patient Page at last threaded the needle of explaining p-values?


#1

In a recent JAMA Oncology Patient Page [1], oncologist H. Jack West and biostatistician Suzanne Dahlberg explain several aspects of cancer clinical trials. I was struck in particular by the description they crafted of p-values:

Statistics Let Us Describe Differences in Outcomes Between Groups

Results for different end points are compared in a few ways to assess whether one strategy is better than another. Differences are considered statistically significant if the probability of the differences seen occurring by chance, based on mathematical calculations, is below a threshold usually defined as 5%: this is also reflected as a P value of .05 or lower (1.00 representing 100% probability of chance explaining the findings).

If one must describe P values in other than scourging terms, this seems to me about as good a job as one could do. Have I overlooked something? It does not escape me how delicate this description is. I note for example how it would collapse with the subtlest change of verb tense—from “occurring” to “having occurred”. Also, I appreciate there’s an awful lot packed into the deus ex machina of “based on mathematical calculations.” But still … wasn’t this as good a job as can be done with this thankless task?

How would you describe P values to a cancer patient advocate who is (as many do) laboring to read the trial literature and make sense of it?

  1. West H, Dahlberg S. Clinical Trials, End Points, and Statistics—Measuring and Comparing Cancer Treatments in Practice. JAMA Oncol . September 2018. doi:10.1001/jamaoncol.2018.3708

#2

A p-value is a sample statistic. It varies from sample to sample. Under a mathematical model that models all relevant aspects of the data generating process except the difference (e.g. between groups), we expect such p-values having a uniform distribution in [0,1]. If the model would include a non-zero difference, the distribution of the p-values would become right-skewed, that is, smaller p-values were expected more likely than larger p-values. A small observed p-value from your sample indicates that your data can be an unexpected or unlikely result of the model that does not consider a difference or a more likely result of a model that does consider a difference.

Having a (very) small p-value indicates that the information of the data at hand w.r.t. the model not considering a difference is (clearly) sufficient to demonstrate that the model is unable to exlpain the data well (and that the data is better explained by the model including a difference).

There is nothing that needs to “occur by chance”. It’s only about the likelihood of data under different models.


#3

“seen occuring by chance” can be improved. This is a bit confusing.


#4

At a purely linguistic level, I agree one could read “seen” as modifying either “differences” or “occurring.” Would “probability of chance occurrence of differences as large as those seen” resolve the confusion?


#5

This ignores the fact that the P-value is calculated assuming no uncontrolled confounding, selection bias or measurement error exists that might also explain the findings.


#6

@TimWatkins I think has put a finger on the most bewildering statement. The commonsense interpretation of the statement is that chance was absolutely certainly the cause of the findings. It ignores the fact that the probability of any single finding is vanishingly small in many cases. The probability is of an effect size at least as big as the observed effect size being observed.
Since the P-value is 1, then all possible effect sizes must be greater or equal to the observed effect size, which much be, then, zero.

So how do we explain this to a patient? I find it hard enough to explain it to researchers!


#7

Agree! This does stick out like a sore thumb now that you point it out. Worse, this parenthetical remark lets the past tense slip in here. That invites interpretation as a (subjective) Bayesian posterior probability, precipitating the very same “collapse” I mentioned in my OP.


#8

Ronan I like the way you’ve dug into this. I’m reminded of the common trope, “It’s Not The Crime, It’s The Cover-Up.” (This is why, with sympathy for authors West & Dahlberg, I described their undertaking as a “thankless task!”) It’s almost as if, no matter what you do, explaining p-values inevitably degenerates into a ‘conceptual cover-up’, and you’re bound to trip up and get caught somehow.


#9

I think the underlying deficiencies of indirect inference come back to bite us when trying to explain to a layperson. People are really good at understanding risk when it comes to direct arguments such as chance of winning a poker hand or your team winning a game. Frequentist statistics did what was computationally feasible at the time, and had to simplify the model in order to do that. I see the null hypothesis as a simplifying assumption. But back to the question at hand, something like the following may work.

The p-value of 0.049 indicates that were the two treatments equivalent (interchangeable), the data observed in the study were somewhat surprising in favor of the new treatment. Specifically, we’d expect 1000 repetitions of clinical trials with equivalent treatments to provide data more surprising that this study’s data in only 49 trials. This doesn’t prove the treatments have different effects on patients in general, or that they don’t, but indicates some degree of evidence against the assumption they are equivalent.

Now do a survey and ask 200 patients to select the more interpretable statement of these two (after better crafting the wording):

  1. If treatments A and B result in the same life expectancy for patients with cancer, one would expect 49 of 1000 repeats of the clinical trial to result in data more favorable to treatment B than what was observed in this study.
  2. If one were to assume before the study that treatment B is just as likely to shorten lifespan as it was to lengthen it, and that the probability that the treatment more than doubles lifespan is only 1 in 20, the study’s data indicates a chance of 0.97 (970 of 1000) that treatment B is beneficial for this cancer. [Could also add: The data also indicate a chance of 0.8 that B increases lifespan by more than 3 months.]

#10

As a veterinarian we dont see as many large RCTs as MDs do. I am somewhat confused when I read a synopsis like the one below. It seems to me that they have conflated statistical significance with clinical significance in the way they word their conclusions.

In 56 recruited SISO horses, either flunixin meglumine (1.1 mg/kg, i.v., q12h) or firocoxib (0.3 mg/kg, i.v. loading dose; 0.1 mg/kg, i.v., q24h) was given in the post-operative period in three university hospitals from 2015 to 2017. COX-2 selectivity was confirmed by a relative lack of inhibition of the COX-1 prostanoid TXB2 by firocoxib and significant inhibition by flunixin meglumine (P = 0.014). Both drugs inhibited the COX-2 prostanoid PGE2 . There were no significant differences in pain scores between groups (P = 0.2). However, there was a 3.23-fold increased risk (P = 0.04) of increased plasma sCD14 in horses treated with flunixin meglumine, a validated biomarker of equine endotoxaemia.


#11

I don’t see them discussion clinical significance per se but they are just caught up in traditional unhelpful language, with a subtle implication that they are making hay of one effect being "insignificant’ in contrast to one that is “significant”. “No significant difference” is near the top of the bullshit level. A suggested rewording: The data do not support the equivalence of firocoxib and funixin meglumine (0.95 compatibility interval [x, y], p=0.014). The data provided little evidence against the assumption of identical pain scores (0.95 compatibility interval [x, y], p=0.2) but modest evidence against the assumption of risk of increase plasma sCD14 (risk ratio 3.23, 0.05 CI [x,y], p=0.04).

This language would be better if we could fine a way to remove all judgments about evidence levels.

Note that there is a major flaw in the statistical analysis. Plasma sCD14 should not be merely judged as increasing or decreasing. The compatibility interval, point estimate, and p-value should be based on the amount of change. Risk ratios should not be used here.


#12

Thank you.

This was the explanation for the evaluation of plasma sCD14

As plasma markers of endotoxaemia, the incidence of sCD14 exceeding
the established reference range and TNFa exceeding zero were recorded
and compared (Table 3) [7]. There were no significant differences in TNFa
levels between the two groups at any time period (Fig 4a, P = 0.50) or in
the proportion of horses with detectable plasma TNFa in each group
(Fig 4b, P = 0.44). In addition, there were no significant differences in
mean sCD14 levels between the two groups at any time period (Fig 4a,
P = 0.5) However, at 48 h post-operatively, 26.9% of horses treated with
flunixin meglumine had sCD14 levels exceeding the upper limit of the
reference range as compared with 8.33% of horses treated with firocoxib
(Fig 4c, d, P = 0.044). This corresponds to a 3.23-fold increased risk of
increased biomarker of endotoxaemia in horses treated with flunixin
meglumine as compared with those treated with firocoxib (CI 0.86–12.94,
P = 0.04)


#13

Not an appropriate statistical analysis IMHO


#14

So perhaps we should stop for futility. Indeed, given how contentious p-values have become, any attempt to present them as uncontroversial must necessarily mislead laypersons. Pondering the feedback on this thread, I now think it better to convey the scientific purpose served by p-values without attempting to detail the underlying ‘theory’ such as it is. Thus, I’m currently thinking along these lines:

In describing the scientific attitude, physicist Richard Feynman famously said, “The first principle is that you must not fool yourself—and you are the easiest person to fool.” When comparing different treatments in a trial, medical scientists know the easiest way to fool themselves would be to hand-pick which patients get which treatments. So they take a hands-off approach and let random numbers make these decisions instead. Randomizing treatment assignment in this way helps medical scientists avoid fooling themselves, but it does create a new risk: They might get fooled by chance itself, simply from ‘bad luck’ in the randomization. Fortunately, Statistics—the ‘science of randomness’—can help manage such risk, at least over the long run.

In the medical sciences, we traditionally aim to let chance fool us into spuriously ‘finding something new’ in at most 5% (0.05) of trials. The statistical procedure for establishing this long-run limit on our ‘bad luck’ involves calculating a special type of statistic called a p-value. This is a number between 0 and 1, and depends both on the data that we collected, and on how we designed our trial. The more the outcomes differ between treatments in our trial, and the better our trial’s design kept randomness in check, the smaller the p-value will be. P-values very close to 0 indicate that the differences seen in our data were too large to be explained away as ‘pure chance’. Customarily, when the p-value is greater than 0.05, scientists will acknowledge that the differences seen between the treatment groups could very well be due to chance. But when p < 0.05, scientists commonly call the differences a “statistically significant finding.”

The justification for this procedure has grown increasingly controversial, with many scientists and statisticians arguing that it is prone to abuse, and that even some of the associated language (like “statistical significance”) has become confusing and harmful. The most important thing for you to know as a patient reading the medical literature, is that p-values are a tool specifically to guard against being fooled by randomness, but that there are other ways to get fooled that p-values don’t protect us from.

My provisional conclusion, then, is that it may make no more sense to explain the theory of p-values than to explain any other recondite technicality of statistical practice, such as how pseudorandom number generation is done. What I have up there still seems awfully wordy, and I would prefer to take the size of West & Dahlberg’s paragraph as a target.


#15

As far as it goes - as far as it can go - JAMA’s wording isn’t bad. Point noted re ‘occurring’ rather than ‘having occurred’. Two, closely related important omissions.
‘As extreme as, or more extreme than, the observed outcome’. This isn’t necessarily trivial to define - it arose in the continuous case where it is clear, but for things like Fisher’s exact test, there’s more than one way to arrange the possible data outcomes.
Also nowadays P-values are assumed by default to be two-sided unless the context makes it obviously inappropriate, viz. equivalence / non-inferiority.


#16

This definition of the p-value is wrong. “A p-value of 0.05 means that there is only a 5% chance that these findings occurred by chance” equals (imho): “We are 95% sure that this effect is real”.

However, it is not possible to conclude if an effect is real based on data alone. See for instance this epistemological paper:
https://quod.lib.umich.edu/p/phimp/3521354.0016.007/1/--why-i-am-not-a-likelihoodist