I fail to see how that follows from what Sander wrote. The quote from Ernst in the Permutation Methods paper (who was also a co-author of the initial paper I linked to) described the problem: parametric assumptions that require large sample properties to hold for our finite, often very small sample, but we don’t know how quickly we converge to the limit.
Philip Good in his book Permutation, Parametric, and Boostrap Tests of Hypotheses (3rd ed, p. 153-154) gave a real world example (categorical data examined via Chi-square statistics) where the permutation p values and the large sample approximations differed by a factor of 10! The permutation test detected differences that asymptotic approximations did not.
I think it is unfortunate permutation methods are invariably perceived as tests; interval estimates and entire compatibility distributions (more commonly called “confidence” distributions) can be created from them. In the case of RCTs they substitute an unverifiable assumption with a design based proposition that can be taken as correct by construction (if you accept the research report as described).
I see this as a separate argument in favor of randomization procedures - they behave well for finite sized data sets compared to asymptotic methods (which is not at all surprising), in situations when the former even exist. This is a different argument than the “match your analysis to your design” principle that has consumed much of the oxygen on this thread.
Nonetheless, the Berger et al quote fails on this line of reasoning as well. Consider the example of stratified risk ratio, mentioned earlier. There is no exact confidence interval for it that has been published (at least last I checked). Others have offered at least two solutions for the asymptotic inference including Greenland and Robins (1985) and Gart and Nam (1988). One cannot simply dismiss these as “without merit” when the competing randomization-based inference isn’t even known to exist. This gets back to my point about nihilism - let’s appreciate people who are actually trying to solve these problems, rather than just those who wave around the one-true-ideology.
Some of the promoters of permutation “methods” are themselves responsible for this fallacy. The seminal Edgington/Onghena book I cited is titled Randomization Tests. See also the title of the Philip Good book cited by @R_cubed. However, his point remains sound - as alluded to above, I’m familiar with this part of the literature, but failed to make the connection earlier on this thread, since much of the polemics deals with p-values, as we’ve seen. I stand corrected.
The George Cobb paper posted by @R_cubed is very intriguing. Cobb was right in calling out our profession for exactly what @ESMD noticed - failing to justify to students what I called the “bait and switch” between two inferential frameworks. That part of his critique remains sound today. However it is only in retrospect (his paper appeared 10 years before Lock Morgan’s) that we can see that Cobb’s proposed solution need not be the only one. Indeed the Lock family themselves (see their intro textbook I cited above) breaks free of the traditions Cobb was upset with, by centering their text on both bootstrap and randomization inference concepts, not soley on randomization inference, as Cobb would have preferred. I do not know if the Lock^5 book directly addresses the justification for “bait and switch” but given Lock Morgan’s research on this very topic, I wouldn’t be surprised if they did. I don’t necessarily endorse the Lock^5 book (I’ve never actually inspected a copy) but present it as one possible alternative to Cobb’s proposal; I’m sure others could be imagined.
Finally, on a different thread @Pavlos_Msaouel reminded us that Piantadosi’s Clinical Trials (3d ed., Wiley), sec. 2.2.4, provides additional relevant discussion of the topic of this thread.
I agree about the George Cobb paper linked above- he was a great communicator. He provides a “big picture” view of frequentist inference and explains how and why stats came to be taught the way it has been for so many years. I sure wish that someone had presented this perspective back in 1990 as I suffered through an eye-wateringly confusing undergraduate stats course…
A few excerpts that stood out for me:
“My thesis is that both the content and the structure of our introductory curriculum are shaped by old history. What we teach was developed a little at a time, for reasons that had a lot to do with the need to use available theory to handle problems that were essentially computational… Intellectually, we are asking our students to do the equivalent of working with one of those old 30-pound Burroughs electric calculators with the rows of little wheels that clicked and spun as they churned out sums of squares.
Now that we have computers, I think a large chunk of what is in the introductory statistics course should go… Our curriculum is needlessly complicated because we put the normal distribution, as an approximate sampling distribution for the mean, at the center of our curriculum, instead of putting the core logic of inference at the center…”
“… First, consider obfuscation. A huge chunk of the introductory course, at least a third, and often much more, is devoted to teaching the students sampling distributions…Why is this bad? It all depends on your goal. The sampling distribution is an important idea, as is the fact that the distribution of the mean converges to a normal. Both have their place in the curriculum. But if your goal is to teach the logic of inference in a first statistics course, the current treatment of these topics is an intellectual albatross…”
“…There’s a vital logical connection between randomized data production and inference, but it gets smothered by the heavy, sopping wet blanket of normal-based approximations to sampling distributions…”
“…We may be living in the early twenty-first century, but our curriculum is still preparing students for applied work typical of the first half of the twentieth century…”
“…What I have trouble getting down my persnickety throat is the use of the sampling model for data from randomized experiments. Do we want students to think that as long as there’s any sort of randomization, that’s all that matters? Do we want them to think that if you choose a random sample but don’t randomize the assignment of treatments, it is still valid to conclude that the treatment caused the observed differences? Do we want them to think that because the assignment of treatments was randomized, then they are entitled to regard the seven patients in the example as representative of all patients, regardless of age or whether their operation was a tonsillectomy or a liver transplant? Do we want students to leave their brains behind and pretend, as we ourselves apparently pretend, that choosing at random from a large normal population is a good model for randomly assigning treatments?..”
“… I’ve become convinced that a huge chunk of statistical theory was developed in order to compute things, or approximate things, that were otherwise out of reach. Until very recently, we had no choice but to rely on analytic methods. The computer has offered to free us and our students from that, but our curriculum is at best in the early stages of accepting the offer…”
“…We need a new curriculum, centered not on the normal distribution, but on the logic of inference. When Copernicus threw away the old notion that the earth was at the center of the universe, and replaced it with a system that put the sun at the center, his revolution brought to power a much simpler intellectual regime. We need to throw away the old notion that the normal approximation to a sampling distribution belongs at the center of our curriculum, and create a new curriculum whose center is the core logic of inference.
What is that core logic? I like to think of it as three Rs: randomize, repeat, reject. Randomize data production; repeat by simulation to see what’s typical and what’s not; reject any model that puts your data in its tail.
*The three Rs of inference: randomize, repeat, reject
1. Randomize data production
• To protect against bias • To provide a basis for inference
- random samples let you generalize to populations - random assignment supports conclusions about cause and effect
2. Repeat by simulation to see what’s typical
• Randomized data production lets you re-randomize, over and over, to see which outcomes are typical, which are not.
3. Reject any model that puts your data in its tail…”
I agree that this is a good way to understand frequentist inference, but I think it doesn’t do justice to the Bayesian perspective that places conditioning on what is known as fundamental.
It is easy to generate useless procedures with frequency guarantees. Sensible use of frequentist methods rely on the likelihood principle in the background. For example:
I don’t know if it is possible, but I wish there was a course that took the perspective of Herman Chernoff – develop an initial Bayesian decision theoretic perspective, then develop frequentist procedures guided by those constraints.
The original podcast I was referring to in this thread was accidentally not saved. Rumor has it that this topic was so boring that everyone fell asleep and forgot to press “save”
We thus did a second take here (Titled “Episode 207: Editorial on COSMIC-313”). We thankfully record this one. The topic is still the same, ie, explain some of the concepts described in my editorial here on this trial. But this time the discussion was much less technical and I do not think I even mentioned random sampling vs random allocation anywhere.
However, this more recent video interview does go into more details on this and other topics related to patient care and research. It was motivated by a real-life extremely unusual case we treated that was popularized here. Additional discussion with the actual patient here.
Directly related and linking to this discussion is my just published commentary at the Harvard Data Science Review on the role of sampling in medicine.
This is in response to the recent paper by Michael Bailey on strategies to improve political forecasting. The ideas he discusses (including in his forthcoming book) have implications across science, including medicine where we are similarly tasked to forecast outcomes at the population and individual levels based on best available data and tools.
Thanks- really interesting! The Bailey paper discusses how the “Meng equation” might be useful to improve the accuracy of election polling, accounting more thoroughly for the non-random nature of polls. Many people who answer calls from pollsters will simply hang up. A key problem is that the willingness of people to identify their political affiliation when contacted randomly by pollsters can be correlated with the opinion itself. If a higher proportion of people who hang up will end up voting for political party A versus party B, the poll will be biased.
Bailey notes that true “random sampling” (which occurs when most people who are contacted randomly agree to respond) is the ideal type of sampling for election polling. However, true random sampling virtually never occurs these days because so few people who are randomly contacted are willing to actually respond. Instead, the process that ends up occurring is better described as “random contact.”
Since medical RCTs also deal with “non-representative” samples (specifically, they use “convenience” samples), it’s reasonable to ask whether ideas to improve election polling accuracy could somehow be extrapolated to medical research.
A few questions/comments about your paper, the Bailey paper, and the Meng 2018 paper:
From Box 1 in your article:
“Medical RCTs primarily focus on making comparative causal inferences applicable not to an external population but to all possible replications of the studywith similar sampling patterns.”
Can you elaborate on the meaning of the bold phrase?
Also from your paper:
“Bailey (2023) also shows the value of other strategies such as random contact, which produce nonrandom samples but have inferential advantagescompared with standard convenience sampling.”
I didn’t see where Bailey discussed convenience sampling (i.e., the type of sampling used in RCTs) in his paper. Your phrasing above might not be implying that he did mention convenience sampling, but rather that you are the one who is contrasting convenience sampling and with random contact (?)
In RCTs, we certainly don’t have random sampling (the ideal scenario in the world of election polling), but neither do we have random contact (?) For example, in order to collect a truly random sample of patients with colon cancer, for inclusion in an RCT, we would need a “master list” of all the patients (?in the world) with colon cancer- this would be our overall “population.” We would then need to be able to randomly “pluck” patients off this list, assigning some to the intervention arm of our trial, and others to the control arm. And in this ideal scenario, everybody who was “plucked” would agree to participate in the trial. As we know, this never occurs. In order for the random contact scenario to apply to RCTs, we’d still need the master list of cancer patients that we could use to perform random contact, offering the chance to participate in the trial- in this scenario though, some might decline to participate; only those who agreed to participate would be enrolled. But again, if there’s no master list, there can be no possible random contact (?)
So where does this leave us with the Meng equation? Are you suggesting that the equation could be applied to the results of RCTs, in order to gauge the degree to which the trial’s result (i.e., the between-arm comparison) might be generalizable (?is this the correct term) to the overall population of patients with the disease in question? Or are you wondering if the result could be useful to justify “transporting,” the RCT result to patients who would not have fulfilled inclusion criteria for the trial?
I’m very hazy on what the “R” is supposed to represent in the “data defect correlation” term of the Meng equation. Bailey described it as the “Response Mechanism.” The 2018 Meng 2018 paper (which looks like it requires a PhD in statistics to understand) describes it as follows:
“Here the letter “R”, which leads to the R-mechanism, is used to remind ourselves of many possible ways that a sample arrived at our desk or disk, most of which are not of a probabilistic sampling nature. For Random sampling, R ≡ {R1 , . . . , RN } has a well-specified joint distribution, conditioning on the sample size Nj =1 Rj = n. This is the case when we conduct probabilistic sampling and we are able to record all the intended data, typically unachievable in practice, other than with Monte Carlo simulations (see Section 5.3).
For many Big Data out there, however, they are either self-Reported or administratively Recorded, with no consideration for probabilistic sampling whatsoever. Even in cases where the data collector started with a probabilistic sampling design, as in many social science studies or governmental data projects, in the end we have only observations from those who choose to Respond, a process which again is outside of the probabilistic sampling framework. These “R-mechanisms” therefore are crucial in determining the accuracy of Gn as an estimator of GN ; for simplicity, hereafter the phrase “recorded” or “recording” will be used to represent all such R-mechanisms.”
I cringe while saying this (because it’s probably completely off base), but I’m assuming that the process of convenience sampling would somehow have to be represented mathematically by the “R” term (??)
Finally, what would we use for the “N” term and the “n” term in the equation when discussing RCTs?
Apologies in advance if I’ve completely lost the plot here…
Great questions. Appreciate the opportunity to elaborate further. Briefly:
Blockquote Can you elaborate on the meaning of the bold phrase?
This refers to the features of the hypothetical population that is being sampled, including the centers/locations participating in the RCT, the time period, the willingness of patients to consent, eligibility criteria (especially exclusion criteria) etc. This population will by definition be different than the population that the RCT inference will be used for (for example your patients in clinic). Even if all else is equal, the time period will be different. This is a key argument in favor of concurrent controls in RCTs.
The data defect correlation in the Meng equation (ρR,Y) is useful when thinking about all non-probability samples, including convenience sampling. It is true that my commentary provides a distinct viewpoint that supplements the Bailey article by approaching these methodologies, originally developed for social science applications, from the perspective of experimentalists. The challenges in generalizing and transporting inferences apply to RCTs as much as in other fields.
Blockquote I’m assuming that the process of convenience sampling would somehow have to be represented mathematically by the “R” term
R denotes whether a patient would be selected into the sample. In the Bailey paper, when R = 1 then a person would be selected into the sample because they would be willing to respond to the poll. When R = 0 then a person would not be willing to respond to the poll and thus, by definition, we cannot know the outcome Y of that person based only on the specific polling study. It will have to be supplemented by external information. When R correlates with Y then the data defect correlation ρR,Y is different than zero and thus contributes to the error (Yn - YN) between the sample mean and population mean.
Random sampling converts ρR,Y to approximately zero thus essentially nullifying Yn - YN. But in non-probability sampling, including convenience sampling, ρR,Y can be different than zero thus increasing Yn - YN. In polling, this can happen when individuals who refuse to answer a poll are more likely to vote for a specific response Y. In medicine, this can happen when individuals who either refuse or cannot enroll in an RCT are more likely to have a good or bad outcome Y.
Finally, what would we use for the “N” term and the “n” term in the equation when discussing RCTs?
While n applies to the sampled population, I propose to choose N depending on the population we wish to make group-specific inferences for.
It will not be easy. Some believe it is impossible. But we have to at least try to attack this problem using our best available and emerging tools. I believe the data defect correlation can be one useful framework towards this goal.
My commentary had two key goals:
To highlight that this is a challenge that is certainly not solved by providing standard group-specific estimates from RCTs. This is uncontroversial in the methodology world but not necessarily communicated enough in medical practice. In oncology, we have even reached the point of having high-profile “randomized non-comparative” trials being published. This is an oxymoron to methodologists, similarly to conducting a randomized non-randomized trial, telling a true myth or sharing an original copy.
To suggest exploring the frameworks being developed around the data defect correlation towards the goal of generalizing and transporting RCT inferences.
As an additional update, our article on interpreting RCTs is now out after more than a year of writing. It addresses various topics often discussed in this forum, including sampling vs allocation, and is freely available to the community along with easy to use tools for analysis and visualization.
Pavlos et al…WOW!!! You did it!! Not only did you tackle the challenge set out in the above block quote, but you and your colleagues have somehow managed to distill, with incredible nuance, the entire Datamethods forum into a 45-page long publication. Sometimes, enough compression produces a diamond. And the best part is that you have presented very tricky concepts in a language that will be eminently accessible by clinicians! You should all be very proud.
This paper makes me indescribably happy, partly because so many opaque concepts became much clearer in my mind, but also because I sense that it could really have a big impact on medical research practice and analysis. This is the way that statistical concepts need to be presented to those without much statistical background, in order to promote understanding. I would go so far as to say that this publication should be required reading for every medical student/resident/researcher.
I wonder if this oxymoronic phrase could help understand the culture of Anomal Pharm, where an oddly non-statistical form of randomization seems to be advanced, in language such as:
“And so the purpose of randomization is not necessarily to establish that one dose of the drug is superior to another dose in a statistical way, or establish statistical non-inferiority between the two doses, but really to have trials that are sufficiently sized so that we can interpret these dose-efficacy and dose-toxicity relationships, and use that information to guide overall decision-making.”
— Mirat Shah
Perhaps a similar spirit is revealed in a paper by Iasonos & O’Quigley [1], where we are instructed that
The goal of randomization is to reduce biases due to known but uncontrollable factors such as prior treatment, age, gender and/or comorbidities.
Yeah, in our recent comment on using randomization for dose optimization we emphasize that it should be done to compare doses. It does indeed appear that the misconception regarding what randomization does runs deep even within the methodology community.