Which error is more beneficial to the society?

We are super focused on p value to avoid type I error. But from societal and patient perspective, if you have to choose between either withholding ineffective therapy/potentially effective therapy vs. giving ineffective/potentially effective therapy - isn’t the latter more preferable? The question has two assumptions - 1) safety profile is acceptable, and 2) while societal cost implication of giving potentially ineffective therapy is huge, but that is a cost discussion. I am trying to focus here not on cost but on concept of which mistake should be more acceptable to the society. Whether we want it or not, we choose between the mistakes subconciously all the time anyway - its just that because we have decided on an arbitrary p value of 0.05 threshold - we justify our decision both ways (<0.05 it ‘works’ and >0.05 it ‘does not work’) without necessarily realizing that this mental satisfaction may only represent a cognitive bias.

3 Likes

Javed we don’t see this fundamental question asked often enough so thanks very much for posting this. This has all kinds of ramifications. I’d like to start my response by describing an opposite type of problem.

Unlike the likelihoodist school of statistics, which has both type I and type II errors converging to zero as the sample size increases, the frequentist school fixed type I error (at say 0.05 for completely arbitrary reasons related to a table that Ronald Fisher constructed in the early 1900s). By fixing type I error at 0.05, the investigator is declaring that it’s OK to assert efficacy even when there is no efficacy fully 1/20th of the time even in a 200,000 patient randomized clinical trial. The idea of allowing assertions of efficacy to be made that often under the null, when the sample size is huge (and power is large), is quite nonsensical to me.

Now turn to the other direction you have brought up. Regarding type II error (1 - power) I have frequently criticized allowing the type II error to be 4 times the type I error (requiring power = 0.8), which assumes that a “false negative” is 4 times less bad than a “false positive” (these are in quotes because type I and type II errors are not false positives but are assertion probabilities under specific efficacy values).

I believe that the key issue is that the frequentist approach came about because it was easy to do and was (on the surface) objective. By creating a more automatic procedure, frequentism divorced itself from the more difficult task of optimum decision making. Many decision analyses are based on the idea of maximizing expected utility, i.e., minimizing expected loss or cost. The inputs to the expected utility are the probability distribution of the quantity of interest (the Bayesian posterior distribution) and the utility function. The utility function recognizes such things as

  • the disease is rare and the study may not be replicated
  • there are no other treatment options
  • the treatment is exceedingly costly to patients or society

So the best decision for a rare disease with no other treatment options may be to rely on weaker evidence, e.g., acting as if efficacy is there if the posterior probability exceeds 0.65 or if the p-value is less than 0.2.

So your question is a very important one, and points out the lack of attention to weighing risks, benefits, costs, etc. We also need to note that there should be different requirements for different settings. If you have the option of prescribing either drug A or drug B and they are both readily available, you can “play the odds” and lean towards B if the evidence is weak but in the right direction, if A and B cost about the same. But a regulator who needs to make a lasting decision of whether to approve a drug will require a higher level of evidence. Using p < 0.05 does not provide the needed evidence even there, and the risk/benefit profile, evidence for a major clinical effect, and other considerations need to be made.

We know that we need to abandon the 0.05 cutoff on p-values so we could generalize the discussion to refer to “smaller” p-values without using specific thresholds.

1 Like

I think that’s one of the deepest question I’ve ever read. To think that the error could ever be positive for society is of philosophical interest.
We are continually sending active drugs to the trash. It’s a fact. But the question of how much you are willing to pay for drugs according to uncertainty is also interesting. Clinically everything is more complicated, because regardless of what the statistics say, there will always be some patient who can benefit from specific drugs (e.g., with specific molecular predictive factors).
My feeling is that the FDA and EMA, in their role as regulators, are horrified by type I error, but until recently, not so much by type II error.
Two very relevant clinical examples.
The RADIANT-2 trial was published in 2011; the use of everolimus was associated with PFS improvement with hazard ratio 0.77, 95% CI 0.59-1.00; one-sided log-rank test p=0.026. A post-hoc analysis was then published suggesting that the benefit was very significant when the effect was adjusted for covariates. The sample size was 429 patients. Despite treatment shortages, it took 5 years before 300 patients were randomized again for the RADIANT-4 trial to show differences with two-sided p-value <0.05. At that time, everolimus was approved.
Faced with this rigidity in making dichotomous decisions, it is curious that FDA opened the hand with immunotherapy, and we are seeing the opposite.
The FDA provisionally approved nivolumab for refractory liver cancer only with phase II data. Now the data come out of a phase III study that cannot demonstrate benefit according to the study design, with HR, 0.85; 95% CI, 0.72-1.02; P = .0752. But again, nivolumab is probably active…

1 Like

but someone must pour many millions into the development of a drug, and the longer it takes to discard an ineffective drug the bigger the loss. I remember how all the big companies dropped their insulin inhalers when mannkind’s results looked weak. They didn’t want to miss out if their competitors were pursuing it, but once maybe novo abandoned their development others dropped like flies

edit: just checked mannkind’s share price: from around $100 a share in 2006 to $1 today

1 Like

The answer is not the evolution of company shares, but what a good regulator should do to maximize the benefit of patients, and society as a whole, taking into account the nature of the errors at stake. It is subtle and profound.

it is not only regulators who posthoc influence what reaches the public; drug development is relevant to the Q, naturally. Such discussion is often aloof to those realities though. Eg, this new re-analysis of anti-depressants suggests ssri’s are a placebo with an adverse effect: bmj open An ever-existing repository of >20 ssri’s. This is the effect of strategic drug-development and marketing, however. We cannot fake a retrospective, philosophical discussion and be aloof to that reality

3 Likes

Dear all

Thank you for your comments - I am learning a lot. All issues raised are true - drug costs, marketing, profit motive, regulatory decision making etc. Whether a more rational decision making would actually lower the drug prices can be argued but we don’t know. When one talks about rational decision making - while my first question was more focused on ‘negative’ trials that missed p-value, but one can also argue if post-trial methods should be used for ‘positive’ trials as to who benefitted and not give the drug to all, just because P value hit some predetermined arbitrary level - so there is no escape from uncertainty to a degree.

But coming back to the question on hand - companies influence, shares, marketing, etc comes after regulators, and regulators come after scientist. So at the first step - scientist - statisticians and clinical trialists - unless we look at option that are most beneficial to society and patients - with both positive and negative trials - the current paradigm will not change. Again, absolutely not being oblivious to cost issues - but this notion of deciding a positive or negative trial based on arbitrary P value is really difficult to justify.

The main problem is that if you approve a “bad” drug - you find out. Not approving a “good” drug - you never do and there are no consequences - and this drives human behavior at all levels of evidence generation.

1 Like

Should some be condemned not to receive active therapy for refractory tumors so that others do not take expensive placebos for depression? It’s not easy. That’s why I was fascinated by your question. Frank has answered it masterfully. Another aspect to keep in mind is that evidence is gradual but decisions (approve, not approve) are dichotomous. Perhaps the matter would be alleviated by introducing also gradual decisions (conditional or partial approvals, etc.).

Cannot demonstrate in the eyes of some but not all (absence of evidence is not evidence of absence …).

Very much appreciate your statement “Should some be condemned not to receive active therapy for refractory tumors so that others do not take expensive placebos for depression?”. Also agree about gradual approval and there is some movement there.

But the big issue is that after years of logical scientific progression of a hypothesis from test tube, small animal, large animal, early phase positive signals of a biology that makes sense - when Phase III is neutral - the interpretation from the abstract conclusion is that the intervention does not work. Most high caliber journals would not allow a more nuanced conclusion. If a journal does, then some on twitter will call it a ‘spin’. Now if the trial is totally neutral or negative, that’s one thing, but if its not but misses the P - in the vast (vast) majority, while there may be many learning from the trial (inclusion, exclusion criteria, trials conduct, adherence, cross over, lost to follow up etc etc etc) - NO ONE will repeat the experiment and the intervention is gone. Wish either trials were less expansive and we could learn to target better in the next trials - or post-hoc logical assessment was not a taboo in trials.

1 Like

Words are subtly loaded by the devil, but in this case, I’m not sure I expressed myself so badly (I didn’t say that nivolumab doesn’t work).

1 Like

There are many oncological examples of this. I include one, the AVAGAST trial. My hypothesis is that antiangiogenic therapy does have some effect on gastric cancer, but the AVAGAST trial was not able to gather sufficient statistical evidence. I think the problem here was treatment being stopped too early, with subsequent angiogenic rebound in the arm with bevacizumab after its discontinuation, but that’s not proven. In any case, the strategy was cancelled.

1 Like

i think youre considering mostly one side of this, ie missing out on at important drug due to stringent alpha?
Re:

well, despite negative results, they test and test and test acupuncture. Here is the latest in jama from july: “acupuncture on the DAM as adjunctive treatment to antianginal therapy showed superior benefits in alleviating angina”.

but as sci based med.org said: “They have no basis is anatomy, physiology, biology, or reality. They are based entirely on prescientific notions of life energy. … We also have thousands of clinical studies which overwhelmingly show no difference in effect from doing acupuncture in different locations”

and all the talk about eg over-medicated children with adhd and the drug industry always re-defining the cut-off for disease and increasing the catchall… My immediate worry wouldn’t be a dearth of drugs

1 Like

Completely agree - but two issues are that while for some disease like diabetes, there may be many drugs, but for some others like HFpEF, there are none - so there is a mismatch. Other issue is that for one condition when there are many drugs, its always assessed incrementally on top of each other - and we never know when the base therapy is no longer needed. Hence the cost related fewer trials overall make it difficult to effectively and cost-effectively device patient management strategies.

1 Like