Information, Evidence, and Statistics for critical research appraisal

In another post about understanding p-values, ESMD made an important observation:

ESMD wrote:
I suspect hat much of the confusion among physicians learning critical appraisal stems from the fact that our teachers are often (?usually) practising MDs, who might not understand the historical foundations of statistics well enough to address questions like this. As a result, these types of questions get glossed over.

In that thread, @Sander_Greenland mentioned Richard Royall – the biostatistician and author of an important philosophical text Statistical Evidence: A Likelihood Paradigm (published in 1997) where he wrote in the Preface:

…Standard statistical methods regularly lead to the misinterpretation of scientific studies. The errors are usually quantitative, when the evidence is judged to be stronger (or weaker) than it really is. But sometimes they are qualitative – sometimes one hypothesis is judged to be supported over another when the opposite is true. These misinterpretations are not a consequence of scientists misusing statistics. They reflect instead a critical defect in current theories of statistics.

The subtle differences in the Neyman-Pearson and Fisher interpretation of p-values Sander describes, is elaborated on in this article I highly recommend:

Royall’s monograph was published only 1 year after the widely cited Sackett et. al paper on Evidence Based Medicine (EBM):

Evidence based medicine is the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients.

Later evolution of EBM involved elevating statistical fallacies into so-called “hierarchies of evidence” that are logically incoherent when looked at carefully. Philosophers have had a field day tearing apart the simplistic “hierarchies of evidence”, which is how I ended up here way back in 2019!

I independently discovered Stegenga’s argument from thinking about “evidence” as closely related to the economic problems of Welfare and Social Choice Theory, where Kenneth Arrow’s famous result seemed applicable. Stegenga’s paper preceded my discovery by a few years, but his impossibility proof was very similar to my own thinking.

A more “statistical” critique is found here:

:new: The fundamental problem of EBM hierarchies and checklists is that they take frequentist pre-data design criteria and apply them in a post-data context. This violation of the likelihood principle is true of the Neyman-Pearson perspective, not the Fisherian one that Sander alluded to. Casella describes the problem in this paper:

Goutis, C., & Casella, G. (1995). Frequentist Post-Data Inference. International Statistical Review / Revue Internationale de Statistique, 63(3), 325–344.

The end result of an experiment is an inference, which is typically made after the data have been seen (a post-data inference). Classical frequency theory has evolved around pre-data inferences, those that can be made in the planning stages of an experiment, before data are collected. Such pre-data inferences are often not reasonable as post-data inferences, leaving a frequentist with no inference conditional on the observed data. We review the various methodologies that have been suggested for frequentist post-data inference, and show how recent results have given us a very reasonable methodology. We also discuss how the pre-data/post-data distinction fits in with, and subsumes, the Bayesian/frequentist distinction

These critical defects have so deformed the “peer reviewed” literature (as well as clinical practice guidelines based upon them) that large areas of scholarship are practicing cargo-cult (from Feynman) or “pathological” science (from psychologist and psychometrician Joel Michell).

A snarky Bayesian might describe it as “researchers pretending to know the difference between their priors and their posteriors.”

My quasi-formal analysis of this in the context of parametric vs. ordinal models was posted here:

Related Threads


Hi Robert

Lots of interesting links and different ideas in your post- some related to statistical methods and others to the validity of core EBM concepts. I just wanted to pick up on one thread from what you’ve written and to offer a physician’s perspective.

Regarding evidence “hierarchies”: maybe this is an unpopular opinion, but I feel pretty strongly that they do actually have a role in medicine, just not in the way they are often perceived by non-statisticians and non-clinical people. I think most practising physicians would agree that the traditional pyramid (RCT at the top) applies in assessments of treatment efficacy but not necessarily in assessment of treatment safety. I’ve noticed that those who tend to argue against an efficacy pyramid are often those in non-clinical professions.

The idea of first “doing no harm” really is a North Star in medical practice. Since it’s possible for a treatment that lacks intrinsic efficacy to have side effects, we strive, as often as possible, to apply therapies that have proven themselves to have meaningful intrinsic efficacy. Applying “dud” therapies to patients has at best a neutral effect, but more likely a net detrimental effect. Even if a therapy lacking intrinsic efficacy were to have very few side effects, there would usually be a cost to the patient or healthcare system in terms of time and money.

Those who extol the virtues of observational evidence point out that patients enrolled in clinical trials are often quite different from those who subsequently receive the treatment in the postmarket setting. This is true. They argue that we have no “guarantee” that patients who would not have fulfilled the inclusion criteria for the pivotal trial(s) (on whose basis the therapy was approved) will benefit from the treatment. This is also true. But if our threshold for every clinical decision we made were a prior demonstration that someone exactly like the patient in front of us had been documented to “respond” to the treatment, we would be waiting forever, since no two patients are exactly the same. We’d spend every day in the office paralyzed with indecision. Very often, the best we can do is to ensure that we are not prescribing an inert/dud therapy- this is the “bet” we make as doctors when we write a prescription. We are saying,“since this drug has shown that it has the ability to improve outcomes in at least some people for this indication, it is not unreasonable to bet that it might also improve the outcome for the patient in front of us.” Whether the particular patient in front of us will end up “responding” to the therapy is something we often can’t accurately predict in advance, and in fact may never actually learn (especially in the case of many preventive therapies like statins, as compared with treatments like bronchodilators, which are used to treat symptomatic conditions).

In short, the over-riding priority of physicians is to avoid exposing patients to dud treatments.

A well-designed RCT will 1) reveal a treatment’s intrinsic efficacy, if it exists; and 2) minimize the chance that we will infer intrinsic efficacy in situations where it doesn’t actually exist. In medicine, there are many examples of observational assessments of efficacy (involving either non-drug therapies or approved drugs that had been used “off-label”), which did not stand up to subsequent experimental testing. When results of observational studies and RCTs of efficacy conflict, physicians will virtually always preferentially trust the RCT (provided it was well-done).

Ultimately, I don’t think that there will ever come a time when observational studies are considered by clinicians to be as compelling as RCTs for demonstrating efficacy. I never really bought into the common argument: “but we can’t do an RCT for everything, so some evidence is better than nothing…isn’t it?” Actually, sub-optimal evidence often IS worse than no evidence, at least where consequential decisions for patients and healthcare systems are concerned. Sub-optimal evidence is insufficiently reliable and lends an undue veneer of certainty in situations where certainty is unwarranted. In turn, unwarranted certainty can profoundly affect clinical decision-making (specifically, the mental risk/benefit calculations that are performed many times every day by practising physicians), potentially with serious consequences for the patient.

Finally, and at the risk of going down a rabbit hole that nobody wants to go down, it’s important to flag a notable exception to the above framework. These are situations in which it’s essential to apply a treatment even in the absence of RCT evidence. The scenarios in question are public health emergencies, during which risk/benefit calculations are often best viewed through the lens of physics (and common sense), rather than human biology and efficacy “pyramids” (e.g., the decision to recommend masking to decrease the spread of an airborne illness).

1 Like

Will have to write a separate paper on the topic at some point but the reason why such evidence hierarchies are problematic is that RCTs have the advantage that they yield (in theory) valid p-values and frequentist CIs for the differences between cohorts but do not offer an “apples-to-apples” comparison, unless we block (the less accurate term is “stratify”) in the design and adjust for strong prognostic factors in the statistical model.

Conversely, there are meticulously detailed and accurate case reports or case series that have yielded more information, in terms of bits of surprisal, describing patients very similar to the ones I may encounter in clinic. All this of course keeping in mind that my clinical practice is highly skewed towards rare and aggressive kidney cancers. But there is certainly a fair amount of RCTs in oncology that are uninformative or as @f2harrell sometimes puts it: the money was wasted. A common reason is that trialists may falsely believe that the random treatment assignment is enough on its own to save a poorly conducted RCT. Thus key aspects of internal validity are ignored, including patient adherence, accurate data collection, as well as proper blocking and covariate adjustment.

This does not invalidate the tremendous value that RCTs can provide in certain contexts. A good strategy is to use whatever source of information works best depending on the problem at hand. For some, this may plausibly consist predominantly of RCTs.

1 Like

Yes, like any other type of study, RCTs have to be well designed in order to be valuable. You are essentially highlighting the fact that RCTs aren’t valuable if they lack assay sensitivity. And if you select your trial subjects poorly, your trial will lack assay sensitivity- it will fail to identify the intrinsic drug efficacy that is actually present.

If you enrol patients with tumours that are being driven by wildly variable biologic pathways into a clinical trial of a therapy that is designed only to affect a specific biologic pathway, you won’t be surprised if your trial doesn’t identify the intrinsic efficacy of that therapy, since only a small fraction of trial subjects even stood a chance to respond. You have wasted your money and have falsely concluded that the therapy “lacks efficacy,” when the problem was instead that you didn’t think carefully enough about the design of your trial. The root failure here, though, is not randomization as a concept, but rather a failure to sufficiently understand the disease you are trying to treat before you start testing treatments for it. This was the underlying point in all those other threads about the poor track record of RCTs for critical care “syndromes.”

As you say, oncology is an outlier compared with other branches of medicine, in the sense that the spectrum of underlying pathology being treated is likely far more complex and variable than the pathology we see in other fields (e.g., cardiology). At the end of the day, though, if you, as an oncologist, have a choice between basing your decision to use a new therapy on a case report involving a single patient who survived much longer than expected, versus an RCT involving patients with similar tumour biology, in which the same therapy was tested against standard of care, I expect that you will consider the RCT results to be more compelling (?)

1 Like

I’m glad you expressed this, because this common attitude encouraged by evidence hierarchies is neither ethically optimal, nor statistically correct.

Richard Royall (who I mentioned above) wrote an excellent paper discussing the ethics and scientific issues around the ECMO trial for infants. I cannot express the logic better than by offering some quotes and recommending reading it.

Some key quotes:

  1. From the abstract:

We urge that the view that randomized clinical trials are the only scientifically valid means of resolving controversies about therapies is mistaken, and we suggest that a faulty statistical principle is partly to blame for this misconception.

  1. In discussing the 1985 ECMO trial where the investigators felt “compelled to conduct a prospective, randomized study” on infants with severe respiratory distress without equipoise, Royall has this to say

This is particularly disturbing to me as a statistician, because it is we statisticians who are largely responsible for creating attitudes and assumptions that compelled this study… In the next section, I propose that the above doctrine springs at least in part, adherence to a faulty statistical principle.

  1. In presenting alternatives to the randomized “play the winner” design, he says:

Science desires randomized clinical trials, it does not demand them. [my emphasis] Moreover, the importance of randomization is exaggerated when historical controls are the only alternative discussed…many of the weaknesses of historical controls can be avoided by using concurrent, nonrandomized controls. And of course, comparing treatment to historical and concurrent controls can provide even greater evidence of efficacy.

4 In discussing the randomization principle (ie. only randomized experiments provides a basis for inference) he writes:

Statistical theory explains why the randomization principle is unacceptable. It does this in terms of the concepts of conditionality and likelihood…the conditional randomization distribution is degenerate, assigning one to the actual allocation used … the only “inference” on the observed data is “I saw what I saw”

I think we’ll have to agree to disagree about this- that’s okay. Unfortunately, I don’t have a list of citations readily available to support my view, but I’m pretty sure that it’s not unique. The priority placed on RCT evidence as the basis for drug approval by regulatory agencies around the world for many decades now provides indirect evidence that this is not a minority opinion.

It’s always possible to find experts on either side of an issue- I guess ultimately we all have to “pick sides…”

Excellent question at the heart of this discussion. The honest practical answer is that such a choice is unnecessary. Why not just use both instead of choosing between them? If we have a well conducted internally and externally valid RCT relevant to our patient in clinic then it would be prudent to take advantage of all the statistical inferential machinery built around RCTs to help us generate plausible uncertainty estimates. But if there is also an additional detailed case report with surprising findings refuting strong hypotheses relevant to our patient then it would be prudent to use that information as well in our final inference. We can think of case reports as patient stories backed by evidence that can be used deductively to refute theories and are most informative when they are anomalies. This is why researchers writing such case reports should treat them with the utmost respect and rigorously provide as much empirical evidence on the details of the story as possible, especially on the aspects of the case that are surprising and incompatible with current theories.

What if the hypothetical RCT and case report are contradictory? Then we would need to probe why that is and which source of information is more relevant to our current patient.

On a theoretical / philosophical level, the dilemma you posed corresponds to one of pure randomization inference (perfect robustness) versus full conditioning (perfect patient relevance). This is a special case of the variance / bias trade-off and it is reasonable to think of this trade-off as part of a continuum and not a simple dichotomy. However, if strongly pressed, I have already nailed my colors to the mast by claiming that patient relevance yielded by biological insights is more fundamental than robustness for patient-centered inferences. This perhaps is personal preference and is not surprising given my day job as a wet lab-based physician scientist.


What’s that line from The Godfather? “Just when I thought I was out…they pull me back in.” I’m trying to get a mountain of charts done this afternoon, Pavlos, but you keep writing interesting things and I feel I must respond… :wink:

I totally agree re the strong evidentiary value of some case reports- when they are well-reported, and in the right clinical situation, they can be incredibly compelling. Examples include scenarios where 1) the disease in question has a strongly predictable adverse outcome in the absence of treatment (e.g., death is inevitable within a very short period of time- with rabies or very aggressive cancers, for example), or 2) when there is evidence of “positive dechallnge” and/or “positive rechallenge.”

If I saw a patient survive for 1 year a condition that was previously known to be fatal in 100% of patients within 3 months, I too would jump for joy- such a finding would likely be compelling enough to change practice. But of course, these types of scenarios are not the norm in medicine, except perhaps in the context of aggressive infectious diseases and cancers (and maybe certain degenerative neurologic diseases). When death in a short time is a clinical certainty, there is very little or no downside to trying new therapies outside the confines of RCTs. But as soon as we move to analyzing case reports involving diseases with any wider “spectrum” of prognosis than certain death in a very short time, the potential downsides of acting on anecdotal evidence manifest themselves (unless positive dechallenge/rechallenge can be demonstrated). If this were not true, then why do we even bother to do clinical trials in oncology at all? Why don’t we simply start applying new therapies based on detailed understanding of tumour biology alone and skip the trials?

1 Like

Good point. As you mentioned above, we can do trials even in one patient challenging them with a certain therapy to learn more about how their disease and constitution react to it. In general, trials and other experiments / interventions tend to be more informative for my applications than observational data. The advantage of RCTs over other experimental evidence is a brilliant inferential framework meticulously developed over the last century to provide plausible and statistically robust estimates of relative treatment effect uncertainty from RCTs.

A practical example I encounter regularly with my patients is when and whether to offer nephrectomy in metastatic clear cell renal cell carcinoma (the most common kidney cancer)? The only contemporary RCT addressing this question was published in the New England Journal of Medicine and is notoriously uninformative: it used a non-inferiority design (with all its limitations) and had a slew of internal validity issues, including the fact that many patients did not follow the treatment assigned in their cohort (see Figure 1), which is the reason why the per-protocol analysis was never presented (it would have been inconclusive as evidenced by the weird semi-per protocol analyses called PP1 and PP2 buried in the supplemental results). The per-protocol analysis was not shown even though in the same journal a year prior it was recommended to show both intention-to-treat and per-protocol analyses for non-inferiority trials and that concerns should be raised if the two analyses are inconsistent with each other.

I do think the authors should be commented for attempting such an RCT. Reading it you can feel the blood, sweat, and tears needed to get it to the finish line. And publishing it in a high impact journal can be considered a reasonable motivation for others to continue designing and conducting ambitious RCTs. But from a practical standpoint, there is little actionable information yielded from this RCT. Thankfully there are many other sources of information that do allow us to make informed recommendations for our patients regarding nephrectomy.

This happens often in various situations and probably not only in oncology. Those not familiar with the nuances of this RCT will simply take it as “level I evidence” or something like that, which can bias recommendations. But this is the value of knowing the subject matter well. Impossible to do for everything and I have the utmost respect for physicians dealing on a regular basis with multiple diverse conditions. The practical value of guidelines, often based on evidential hierarchies, is that they provide an often reasonable roadmap for generalists.

1 Like

Regulatory authorities have a more nuanced approach to this issue than simplistic hierarchies would lead you to believe.

This is a question of logic and mathematics that the EBM literature simply gets wrong.

For controlled trials (which RCTs are a proper subset):
The CONSORT Statement – explanation and elaboration (pdf), pg 9 box 2. states regarding the adaptive allocation method known as minimization:

Nevertheless, in general, trials that use minimization are considered methodologically equivalent to randomized trials, even when a random element is not incorporated.

From Doug Altman’s 2005 article:

Minimisation is a valid alternative to ordinary randomisation and has the advantage, especially in small trials, that there will be only minor differences between groups in those variables used in the allocation process.

I could point to similar statements by the FDA and EMA.

My point: randomization is (very) useful but not necessary for causal inference.

A stronger case for skepticism of observational reports is that the allocation mechanism is not controlled by the experimenter. While true, this is something that can be overcome, and should be judged on a case by case basis, not decided against a priori simply by a design feature that becomes irrelevant after the data are in.

I’ll be the first person to admit that I’m no stats expert. But I’m pretty familiar with the type of evidence that regulatory agencies expect in making decisions around drug approval and withdrawal- I worked for one for 10 years (!) The job provided a considerable education in how observational evidence is handled by regulatory authorities. Is there a long list hidden somewhere of drugs that were approved without evidence from RCTs?

1 Like

I’m not specifically talking about drugs, but the logic of causal inference for any intervention. Randomization simply isn’t necessary, and this is clear from decision theory.

A case study:

This meta-analysis of 6 small sample studies of immobilization in external rotation vs internal rotation for initial dislocation of the shoulder “failed to find a significant difference.”

The CI is 0.42-1.14 with p=0.15. The data are compatible with a pretty large improvement or a very small increase in risk of recurrence.

Do we really need a study with 200+ subjects in each arm (for a reliable RCT), or can this be studied using techniques that would increase power via matching on covariates?

Fortunately, I don’t think that someone like me (a family doctor) would ever presume to make treatment recommendations for a patient with metastatic cancer :slightly_smiling_face: So you don’t need to worry about the potential consequence if I were to misinterpret the results of that particular NEJM trial. Nor would I expect that the results of a trial with the limitations you have outlined would necessarily be put into practice by the specialist community (the group in the best position to identify such limitations).

Nobody is arguing that the results of RCTs should automatically be put into practice without deep scrutiny of their methods and examination for potentially fatal flaws, ideally by those with the deepest subject matter expertise. Only if there is agreement by those with both subject matter and methodologic expertise (and sometimes the Venn diagram doesn’t entirely overlap) should a strong recommendation make its way into a Clinical Practice Guideline. In other words, it’s important not to conflate the idea of an efficacy “hierarchy” with the idea that RCTs are “infallible”- as is true with any study design, RCT results always require deep critical analysis.


Perhaps some additional background reading for this topic could be S. Piantadosi, Clinical Trials 3/e, chapter 2, especially Sec. 2.2.4 “Clinical and Statistical Reasoning Converge in Research” and Sec. 2.3.4, “Experiments Can Be Misunderstood” but also the rest of the chapter.

1 Like

Excellent and relevant book on this topic indeed. Section 2.2.4 is also pertinent to the original random sampling vs random allocation thread.

Agreed, I should have thought of it ealier!

1 Like

Wonderful post and discussion following it. On this one statement I’d offer that finding treatments that are truly effective must be at least equal in importance to a physician. A “dud” can be compensated for by switching treatments, but if there are no effective treatments to switch to the patient is out of luck.
And in oncology I see too much excitement about costly, minimally effective treatments.

Worth digging deeper into this statement: the concept of cost relates to the trade-offs used to make decisions. It may help to distinguish between descriptions, inferences, and decisions, e.g., as we describe using oncology examples here. Often in this forum I focus on statistical inferences, highlighting the role of biological and other mechanisms that could generate the observed data. However, in the statistical methodology literature our team’s contributions are, at least superficially, closest to the Bayesian decision-theoretic framework that converts inferences into decisions.

Along these lines, we recently published methodology to personalize treatment decisions in oncology based on covariate-specific utility functions from RCT datasets here. In a conceptually similar manner, we also recently developed a Bayesian treatment selection approach that capitalizes on the advantages of random allocation and concurrent control.

1 Like