In a recent JAMA Oncology Patient Page [1], oncologist H. Jack West and biostatistician Suzanne Dahlberg explain several aspects of cancer clinical trials. I was struck in particular by the description they crafted of p-values:
Statistics Let Us Describe Differences in Outcomes Between Groups
Results for different end points are compared in a few ways to assess whether one strategy is better than another. Differences are considered statistically significant if the probability of the differences seen occurring by chance, based on mathematical calculations, is below a threshold usually defined as 5%: this is also reflected as a P value of .05 or lower (1.00 representing 100% probability of chance explaining the findings).
If one must describe P values in other than scourging terms, this seems to me about as good a job as one could do. Have I overlooked something? It does not escape me how delicate this description is. I note for example how it would collapse with the subtlest change of verb tenseâfrom âoccurringâ to âhaving occurredâ. Also, I appreciate thereâs an awful lot packed into the deus ex machina of âbased on mathematical calculations.â But still ⌠wasnât this as good a job as can be done with this thankless task?
How would you describe P values to a cancer patient advocate who is (as many do) laboring to read the trial literature and make sense of it?
West H, Dahlberg S. Clinical Trials, End Points, and StatisticsâMeasuring and Comparing Cancer Treatments in Practice. JAMA Oncol . September 2018. doi:10.1001/jamaoncol.2018.3708
A p-value is a sample statistic. It varies from sample to sample. Under a mathematical model that models all relevant aspects of the data generating process except the difference (e.g. between groups), we expect such p-values having a uniform distribution in [0,1]. If the model would include a non-zero difference, the distribution of the p-values would become right-skewed, that is, smaller p-values were expected more likely than larger p-values. A small observed p-value from your sample indicates that your data can be an unexpected or unlikely result of the model that does not consider a difference or a more likely result of a model that does consider a difference.
Having a (very) small p-value indicates that the information of the data at hand w.r.t. the model not considering a difference is (clearly) sufficient to demonstrate that the model is unable to exlpain the data well (and that the data is better explained by the model including a difference).
There is nothing that needs to âoccur by chanceâ. Itâs only about the likelihood of data under different models.
At a purely linguistic level, I agree one could read âseenâ as modifying either âdifferencesâ or âoccurring.â Would âprobability of chance occurrence of differences as large as those seenâ resolve the confusion?
This ignores the fact that the P-value is calculated assuming no uncontrolled confounding, selection bias or measurement error exists that might also explain the findings.
@TimWatkins I think has put a finger on the most bewildering statement. The commonsense interpretation of the statement is that chance was absolutely certainly the cause of the findings. It ignores the fact that the probability of any single finding is vanishingly small in many cases. The probability is of an effect size at least as big as the observed effect size being observed.
Since the P-value is 1, then all possible effect sizes must be greater or equal to the observed effect size, which much be, then, zero.
So how do we explain this to a patient? I find it hard enough to explain it to researchers!
Agree! This does stick out like a sore thumb now that you point it out. Worse, this parenthetical remark lets the past tense slip in here. That invites interpretation as a (subjective) Bayesian posterior probability, precipitating the very same âcollapseâ I mentioned in my OP.
Ronan I like the way youâve dug into this. Iâm reminded of the common trope, âItâs Not The Crime, Itâs The Cover-Up.â (This is why, with sympathy for authors West & Dahlberg, I described their undertaking as a âthankless task!â) Itâs almost as if, no matter what you do, explaining p-values inevitably degenerates into a âconceptual cover-upâ, and youâre bound to trip up and get caught somehow.
I think the underlying deficiencies of indirect inference come back to bite us when trying to explain to a layperson. People are really good at understanding risk when it comes to direct arguments such as chance of winning a poker hand or your team winning a game. Frequentist statistics did what was computationally feasible at the time, and had to simplify the model in order to do that. I see the null hypothesis as a simplifying assumption. But back to the question at hand, something like the following may work.
The p-value of 0.049 indicates that were the two treatments equivalent (interchangeable), the data observed in the study were somewhat surprising in favor of the new treatment. Specifically, weâd expect 1000 repetitions of clinical trials with equivalent treatments to provide data more surprising that this studyâs data in only 49 trials. This doesnât prove the treatments have different effects on patients in general, or that they donât, but indicates some degree of evidence against the assumption they are equivalent.
Now do a survey and ask 200 patients to select the more interpretable statement of these two (after better crafting the wording):
If treatments A and B result in the same life expectancy for patients with cancer, one would expect 49 of 1000 repeats of the clinical trial to result in data more favorable to treatment B than what was observed in this study.
If one were to assume before the study that treatment B is just as likely to shorten lifespan as it was to lengthen it, and that the probability that the treatment more than doubles lifespan is only 1 in 20, the studyâs data indicates a chance of 0.97 (970 of 1000) that treatment B is beneficial for this cancer. [Could also add: The data also indicate a chance of 0.8 that B increases lifespan by more than 3 months.]
As a veterinarian we dont see as many large RCTs as MDs do. I am somewhat confused when I read a synopsis like the one below. It seems to me that they have conflated statistical significance with clinical significance in the way they word their conclusions.
In 56 recruited SISO horses, either flunixin meglumine (1.1 mg/kg, i.v., q12h) or firocoxib (0.3 mg/kg, i.v. loading dose; 0.1 mg/kg, i.v., q24h) was given in the post-operative period in three university hospitals from 2015 to 2017. COX-2 selectivity was confirmed by a relative lack of inhibition of the COX-1 prostanoid TXB2 by firocoxib and significant inhibition by flunixin meglumine (P = 0.014). Both drugs inhibited the COX-2 prostanoid PGE2 . There were no significant differences in pain scores between groups (P = 0.2). However, there was a 3.23-fold increased risk (P = 0.04) of increased plasma sCD14 in horses treated with flunixin meglumine, a validated biomarker of equine endotoxaemia.
I donât see them discussion clinical significance per se but they are just caught up in traditional unhelpful language, with a subtle implication that they are making hay of one effect being "insignificantâ in contrast to one that is âsignificantâ. âNo significant differenceâ is near the top of the bullshit level. A suggested rewording: The data do not support the equivalence of firocoxib and funixin meglumine (0.95 compatibility interval [x, y], p=0.014). The data provided little evidence against the assumption of identical pain scores (0.95 compatibility interval [x, y], p=0.2) but modest evidence against the assumption of risk of increase plasma sCD14 (risk ratio 3.23, 0.05 CI [x,y], p=0.04).
This language would be better if we could fine a way to remove all judgments about evidence levels.
Note that there is a major flaw in the statistical analysis. Plasma sCD14 should not be merely judged as increasing or decreasing. The compatibility interval, point estimate, and p-value should be based on the amount of change. Risk ratios should not be used here.
This was the explanation for the evaluation of plasma sCD14
As plasma markers of endotoxaemia, the incidence of sCD14 exceeding
the established reference range and TNFa exceeding zero were recorded
and compared (Table 3) [7]. There were no significant differences in TNFa
levels between the two groups at any time period (Fig 4a, P = 0.50) or in
the proportion of horses with detectable plasma TNFa in each group
(Fig 4b, P = 0.44). In addition, there were no significant differences in
mean sCD14 levels between the two groups at any time period (Fig 4a,
P = 0.5) However, at 48 h post-operatively, 26.9% of horses treated with
flunixin meglumine had sCD14 levels exceeding the upper limit of the
reference range as compared with 8.33% of horses treated with firocoxib
(Fig 4c, d, P = 0.044). This corresponds to a 3.23-fold increased risk of
increased biomarker of endotoxaemia in horses treated with flunixin
meglumine as compared with those treated with firocoxib (CI 0.86â12.94,
P = 0.04)
So perhaps we should stop for futility. Indeed, given how contentious p-values have become, any attempt to present them as uncontroversial must necessarily mislead laypersons. Pondering the feedback on this thread, I now think it better to convey the scientific purpose served by p-values without attempting to detail the underlying âtheoryâ such as it is. Thus, Iâm currently thinking along these lines:
In describing the scientific attitude, physicist Richard Feynman famously said, âThe first principle is that you must not fool yourselfâand you are the easiest person to fool.â When comparing different treatments in a trial, medical scientists know the easiest way to fool themselves would be to hand-pick which patients get which treatments. So they take a hands-off approach and let random numbers make these decisions instead. Randomizing treatment assignment in this way helps medical scientists avoid fooling themselves, but it does create a new risk: They might get fooled by chance itself, simply from âbad luckâ in the randomization. Fortunately, Statisticsâthe âscience of randomnessââcan help manage such risk, at least over the long run.
In the medical sciences, we traditionally aim to let chance fool us into spuriously âfinding something newâ in at most 5% (0.05) of trials. The statistical procedure for establishing this long-run limit on our âbad luckâ involves calculating a special type of statistic called a p-value. This is a number between 0 and 1, and depends both on the data that we collected, and on how we designed our trial. The more the outcomes differ between treatments in our trial, and the better our trialâs design kept randomness in check, the smaller the p-value will be. P-values very close to 0 indicate that the differences seen in our data were too large to be explained away as âpure chanceâ. Customarily, when the p-value is greater than 0.05, scientists will acknowledge that the differences seen between the treatment groups could very well be due to chance. But when p < 0.05, scientists commonly call the differences a âstatistically significant finding.â
The justification for this procedure has grown increasingly controversial, with many scientists and statisticians arguing that it is prone to abuse, and that even some of the associated language (like âstatistical significanceâ) has become confusing and harmful. The most important thing for you to know as a patient reading the medical literature, is that p-values are a tool specifically to guard against being fooled by randomness, but that there are other ways to get fooled that p-values donât protect us from.
My provisional conclusion, then, is that it may make no more sense to explain the theory of p-values than to explain any other recondite technicality of statistical practice, such as how pseudorandom number generation is done. What I have up there still seems awfully wordy, and I would prefer to take the size of West & Dahlbergâs paragraph as a target.
As far as it goes - as far as it can go - JAMAâs wording isnât bad. Point noted re âoccurringâ rather than âhaving occurredâ. Two, closely related important omissions.
âAs extreme as, or more extreme than, the observed outcomeâ. This isnât necessarily trivial to define - it arose in the continuous case where it is clear, but for things like Fisherâs exact test, thereâs more than one way to arrange the possible data outcomes.
Also nowadays P-values are assumed by default to be two-sided unless the context makes it obviously inappropriate, viz. equivalence / non-inferiority.
This definition of the p-value is wrong. âA p-value of 0.05 means that there is only a 5% chance that these findings occurred by chanceâ equals (imho): âWe are 95% sure that this effect is realâ.