Random sampling versus random allocation/randomization- implications for p-value interpretation

I agree about the George Cobb paper linked above- he was a great communicator. He provides a “big picture” view of frequentist inference and explains how and why stats came to be taught the way it has been for so many years. I sure wish that someone had presented this perspective back in 1990 as I suffered through an eye-wateringly confusing undergraduate stats course…

A few excerpts that stood out for me:

“My thesis is that both the content and the structure of our introductory curriculum are shaped by old history. What we teach was developed a little at a time, for reasons that had a lot to do with the need to use available theory to handle problems that were essentially computational… Intellectually, we are asking our students to do the equivalent of working with one of those old 30-pound Burroughs electric calculators with the rows of little wheels that clicked and spun as they churned out sums of squares.

Now that we have computers, I think a large chunk of what is in the introductory statistics course should go… Our curriculum is needlessly complicated because we put the normal distribution, as an approximate sampling distribution for the mean, at the center of our curriculum, instead of putting the core logic of inference at the center…”

“… First, consider obfuscation. A huge chunk of the introductory course, at least a third, and often much more, is devoted to teaching the students sampling distributions…Why is this bad? It all depends on your goal. The sampling distribution is an important idea, as is the fact that the distribution of the mean converges to a normal. Both have their place in the curriculum. But if your goal is to teach the logic of inference in a first statistics course, the current treatment of these topics is an intellectual albatross…”

“…There’s a vital logical connection between randomized data production and inference, but it gets smothered by the heavy, sopping wet blanket of normal-based approximations to sampling distributions…”

“…We may be living in the early twenty-first century, but our curriculum is still preparing students for applied work typical of the first half of the twentieth century…”

“…What I have trouble getting down my persnickety throat is the use of the sampling model for data from randomized experiments. Do we want students to think that as long as there’s any sort of randomization, that’s all that matters? Do we want them to think that if you choose a random sample but don’t randomize the assignment of treatments, it is still valid to conclude that the treatment caused the observed differences? Do we want them to think that because the assignment of treatments was randomized, then they are entitled to regard the seven patients in the example as representative of all patients, regardless of age or whether their operation was a tonsillectomy or a liver transplant? Do we want students to leave their brains behind and pretend, as we ourselves apparently pretend, that choosing at random from a large normal population is a good model for randomly assigning treatments?..”

“… I’ve become convinced that a huge chunk of statistical theory was developed in order to compute things, or approximate things, that were otherwise out of reach. Until very recently, we had no choice but to rely on analytic methods. The computer has offered to free us and our students from that, but our curriculum is at best in the early stages of accepting the offer…”

“…We need a new curriculum, centered not on the normal distribution, but on the logic of inference. When Copernicus threw away the old notion that the earth was at the center of the universe, and replaced it with a system that put the sun at the center, his revolution brought to power a much simpler intellectual regime. We need to throw away the old notion that the normal approximation to a sampling distribution belongs at the center of our curriculum, and create a new curriculum whose center is the core logic of inference.

What is that core logic? I like to think of it as three Rs: randomize, repeat, reject. Randomize data production; repeat by simulation to see what’s typical and what’s not; reject any model that puts your data in its tail.

*The three Rs of inference: randomize, repeat, reject

1. Randomize data production

• To protect against bias
• To provide a basis for inference

- random samples let you generalize to populations
- random assignment supports conclusions about cause and effect

2. Repeat by simulation to see what’s typical

• Randomized data production lets you re-randomize, over and over, to see which outcomes are typical, which are not.

3. Reject any model that puts your data in its tail…”

5 Likes

Brilliant. Thanks so much posting this, which I had not seen before.

I agree that this is a good way to understand frequentist inference, but I think it doesn’t do justice to the Bayesian perspective that places conditioning on what is known as fundamental.

It is easy to generate useless procedures with frequency guarantees. Sensible use of frequentist methods rely on the likelihood principle in the background. For example:

I don’t know if it is possible, but I wish there was a course that took the perspective of Herman Chernoff – develop an initial Bayesian decision theoretic perspective, then develop frequentist procedures guided by those constraints.

Could you link the podcast?

1 Like

The original podcast I was referring to in this thread was accidentally not saved. Rumor has it that this topic was so boring that everyone fell asleep and forgot to press “save” :slight_smile:

We thus did a second take here (Titled “Episode 207: Editorial on COSMIC-313”). We thankfully record this one. The topic is still the same, ie, explain some of the concepts described in my editorial here on this trial. But this time the discussion was much less technical and I do not think I even mentioned random sampling vs random allocation anywhere.

However, this more recent video interview does go into more details on this and other topics related to patient care and research. It was motivated by a real-life extremely unusual case we treated that was popularized here. Additional discussion with the actual patient here.

3 Likes

Directly related and linking to this discussion is my just published commentary at the Harvard Data Science Review on the role of sampling in medicine.

This is in response to the recent paper by Michael Bailey on strategies to improve political forecasting. The ideas he discusses (including in his forthcoming book) have implications across science, including medicine where we are similarly tasked to forecast outcomes at the population and individual levels based on best available data and tools.

3 Likes

Hope @Stephen will be able to look at Bailey’s paper.

1 Like

Tremendously looking forward to his insights!

Hi Pavlos

Thanks- really interesting! The Bailey paper discusses how the “Meng equation” might be useful to improve the accuracy of election polling, accounting more thoroughly for the non-random nature of polls. Many people who answer calls from pollsters will simply hang up. A key problem is that the willingness of people to identify their political affiliation when contacted randomly by pollsters can be correlated with the opinion itself. If a higher proportion of people who hang up will end up voting for political party A versus party B, the poll will be biased.

Bailey notes that true “random sampling” (which occurs when most people who are contacted randomly agree to respond) is the ideal type of sampling for election polling. However, true random sampling virtually never occurs these days because so few people who are randomly contacted are willing to actually respond. Instead, the process that ends up occurring is better described as “random contact.”

Since medical RCTs also deal with “non-representative” samples (specifically, they use “convenience” samples), it’s reasonable to ask whether ideas to improve election polling accuracy could somehow be extrapolated to medical research.

A few questions/comments about your paper, the Bailey paper, and the Meng 2018 paper:

  1. From Box 1 in your article:

“Medical RCTs primarily focus on making comparative causal inferences applicable not to an external population but to all possible replications of the study with similar sampling patterns .”

Can you elaborate on the meaning of the bold phrase?

  1. Also from your paper:

“Bailey (2023) also shows the value of other strategies such as random contact, which produce nonrandom samples but have inferential advantages compared with standard convenience sampling .”

I didn’t see where Bailey discussed convenience sampling (i.e., the type of sampling used in RCTs) in his paper. Your phrasing above might not be implying that he did mention convenience sampling, but rather that you are the one who is contrasting convenience sampling and with random contact (?)

In RCTs, we certainly don’t have random sampling (the ideal scenario in the world of election polling), but neither do we have random contact (?) For example, in order to collect a truly random sample of patients with colon cancer, for inclusion in an RCT, we would need a “master list” of all the patients (?in the world) with colon cancer- this would be our overall “population.” We would then need to be able to randomly “pluck” patients off this list, assigning some to the intervention arm of our trial, and others to the control arm. And in this ideal scenario, everybody who was “plucked” would agree to participate in the trial. As we know, this never occurs. In order for the random contact scenario to apply to RCTs, we’d still need the master list of cancer patients that we could use to perform random contact, offering the chance to participate in the trial- in this scenario though, some might decline to participate; only those who agreed to participate would be enrolled. But again, if there’s no master list, there can be no possible random contact (?)

So where does this leave us with the Meng equation? Are you suggesting that the equation could be applied to the results of RCTs, in order to gauge the degree to which the trial’s result (i.e., the between-arm comparison) might be generalizable (?is this the correct term) to the overall population of patients with the disease in question? Or are you wondering if the result could be useful to justify “transporting,” the RCT result to patients who would not have fulfilled inclusion criteria for the trial?

I’m very hazy on what the “R” is supposed to represent in the “data defect correlation” term of the Meng equation. Bailey described it as the “Response Mechanism.” The 2018 Meng 2018 paper (which looks like it requires a PhD in statistics to understand) describes it as follows:

“Here the letter “R”, which leads to the R-mechanism, is used to remind ourselves of many possible ways that a sample arrived at our desk or disk, most of which are not of a probabilistic sampling nature. For Random sampling, R ≡ {R1 , . . . , RN } has a well-specified joint distribution, conditioning on the sample size 􏰃Nj =1 Rj = n. This is the case when we conduct probabilistic sampling and we are able to record all the intended data, typically unachievable in practice, other than with Monte Carlo simulations (see Section 5.3).

For many Big Data out there, however, they are either self-Reported or administratively Recorded, with no consideration for probabilistic sampling whatsoever. Even in cases where the data collector started with a probabilistic sampling design, as in many social science studies or governmental data projects, in the end we have only observations from those who choose to Respond, a process which again is outside of the probabilistic sampling framework. These “R-mechanisms” therefore are crucial in determining the accuracy of Gn as an estimator of GN ; for simplicity, hereafter the phrase “recorded” or “recording” will be used to represent all such R-mechanisms.”

I cringe while saying this (because it’s probably completely off base), but I’m assuming that the process of convenience sampling would somehow have to be represented mathematically by the “R” term (??)

Finally, what would we use for the “N” term and the “n” term in the equation when discussing RCTs?

Apologies in advance if I’ve completely lost the plot here…

1 Like

Great questions. Appreciate the opportunity to elaborate further. Briefly:

Blockquote Can you elaborate on the meaning of the bold phrase?

This refers to the features of the hypothetical population that is being sampled, including the centers/locations participating in the RCT, the time period, the willingness of patients to consent, eligibility criteria (especially exclusion criteria) etc. This population will by definition be different than the population that the RCT inference will be used for (for example your patients in clinic). Even if all else is equal, the time period will be different. This is a key argument in favor of concurrent controls in RCTs.

The data defect correlation in the Meng equation (ρR,Y) is useful when thinking about all non-probability samples, including convenience sampling. It is true that my commentary provides a distinct viewpoint that supplements the Bailey article by approaching these methodologies, originally developed for social science applications, from the perspective of experimentalists. The challenges in generalizing and transporting inferences apply to RCTs as much as in other fields.

Blockquote I’m assuming that the process of convenience sampling would somehow have to be represented mathematically by the “R” term

R denotes whether a patient would be selected into the sample. In the Bailey paper, when R = 1 then a person would be selected into the sample because they would be willing to respond to the poll. When R = 0 then a person would not be willing to respond to the poll and thus, by definition, we cannot know the outcome Y of that person based only on the specific polling study. It will have to be supplemented by external information. When R correlates with Y then the data defect correlation ρR,Y is different than zero and thus contributes to the error (Yn - YN) between the sample mean and population mean.

Random sampling converts ρR,Y to approximately zero thus essentially nullifying Yn - YN. But in non-probability sampling, including convenience sampling, ρR,Y can be different than zero thus increasing Yn - YN. In polling, this can happen when individuals who refuse to answer a poll are more likely to vote for a specific response Y. In medicine, this can happen when individuals who either refuse or cannot enroll in an RCT are more likely to have a good or bad outcome Y.

Finally, what would we use for the “N” term and the “n” term in the equation when discussing RCTs?

While n applies to the sampled population, I propose to choose N depending on the population we wish to make group-specific inferences for.

2 Likes

Thanks for those clarifications. I guess a key question is how we figure out the value of “R” when we’re talking about a convenience sample…

1 Like

It will not be easy. Some believe it is impossible. But we have to at least try to attack this problem using our best available and emerging tools. I believe the data defect correlation can be one useful framework towards this goal.

My commentary had two key goals:

  1. To highlight that this is a challenge that is certainly not solved by providing standard group-specific estimates from RCTs. This is uncontroversial in the methodology world but not necessarily communicated enough in medical practice. In oncology, we have even reached the point of having high-profile “randomized non-comparative” trials being published. This is an oxymoron to methodologists, similarly to conducting a randomized non-randomized trial, telling a true myth or sharing an original copy.

  2. To suggest exploring the frameworks being developed around the data defect correlation towards the goal of generalizing and transporting RCT inferences.

1 Like

As an additional update, our article on interpreting RCTs is now out after more than a year of writing. It addresses various topics often discussed in this forum, including sampling vs allocation, and is freely available to the community along with easy to use tools for analysis and visualization.

2 Likes

Pavlos et al…WOW!!! You did it!! Not only did you tackle the challenge set out in the above block quote, but you and your colleagues have somehow managed to distill, with incredible nuance, the entire Datamethods forum into a 45-page long publication. Sometimes, enough compression produces a diamond. And the best part is that you have presented very tricky concepts in a language that will be eminently accessible by clinicians! You should all be very proud.

This paper makes me indescribably happy, partly because so many opaque concepts became much clearer in my mind, but also because I sense that it could really have a big impact on medical research practice and analysis. This is the way that statistical concepts need to be presented to those without much statistical background, in order to promote understanding. I would go so far as to say that this publication should be required reading for every medical student/resident/researcher.

Congratulations once again- a real magnum opus.

2 Likes

I wonder if this oxymoronic phrase could help understand the culture of Anomal Pharm, where an oddly non-statistical form of randomization seems to be advanced, in language such as:

“And so the purpose of randomization is not necessarily to establish that one dose of the drug is superior to another dose in a statistical way, or establish statistical non-inferiority between the two doses, but really to have trials that are sufficiently sized so that we can interpret these dose-efficacy and dose-toxicity relationships, and use that information to guide overall decision-making.”
— Mirat Shah

Perhaps a similar spirit is revealed in a paper by Iasonos & O’Quigley [1], where we are instructed that

The goal of randomization is to reduce biases due to known but uncontrollable factors such as prior treatment, age, gender and/or comorbidities.

  1. Iasonos A, O’Quigley J. Randomised Phase 1 clinical trials in oncology. Br J Cancer. 2021;125(7):920-926. doi:10.1038/s41416-021-01412-y
1 Like

Good gosh. Would have been more accurate to say that the goal is to reduce bias due to unknown factors.

4 Likes

Yeah, in our recent comment on using randomization for dose optimization we emphasize that it should be done to compare doses. It does indeed appear that the misconception regarding what randomization does runs deep even within the methodology community.

Not sure if you’re just being diplomatic, Pavlos, but frankly all of this looks too willful to be anything so innocent as a pure ‘misconception’. The key difficulty for OCE would seem to be the 3rd of your 4 recommendations in the abstract of your comment: enroll an adequate sample size.

In my own open-access «ahem» article in the upcoming CPT: PSP special themed issue on Dose Optimization, I estimate a lower bound on the required sample sizes which I have good reason to believe OCE finds politically infeasible.

:boom: Norris DC. How large must a dose‐optimization trial be? CPT: Pharmacometrics & Systems Pharmacology. Published online September 4, 2023. doi:10.1002/psp4.13041 (See also this lay explainer, with an embedded Observable app.)

4 Likes

Abovementioned CPT:PSP issue includes also a Perspective from FDA CDER’s Office of Clinical Pharmacology (OCP). This piece de-emphasizes “randomized parallel dosage comparisons” (which were positively hyped by OCE’s Project Optimus), mentioning them just once in passing, as if an afterthought:

Controlled backfill, strategically planned expansion cohorts, and randomized parallel dosage comparisons can yield additional clinical data to increase confidence that dosages carried forward into registration trials have been optimized.

Conversely, the piece opens the door to dose individualization — in what I believe to be a true first for FDA:

Patient factors (i.e., age, race and ethnicity, and organ impairment), disease factors (i.e., genomic and molecular variation in the purported drug target), and their interactions can result in the need for individualized dosages at the patient- or subpopulation-level and may necessitate different trial designs and analytical approaches from those of the past.

I have posted a comment on this Perspective at PubPeer.

3 Likes

David, As a former member of the CIRB for early phase cancer trials, I have a special appreciation for your work and persistence on this initiative. My hope is that applying individualized titration study methods to identify the phase 2 starting dose will increase the participant’s prospect for benefit, decrease the risk of SAE – and by so doing improve study accrual across all studies.

2 Likes