I thought what you meant by this was a form of sensitivity analysis, to show to what extent an estimation of power or sample size could be affected by different estimations of MCID. I was unable to find your calculation in the link that you provided to confirm this, so I tried to do the same using my BP example to see if this is what you meant in the quote. Clearly I had misunderstood. Sorry.
The BP example seemed to be different but maybe I missed something. You can think of this as a sensitivity analysis, or better still as a replacement for that that doesnât have the subjectivity of how you are influenced by a sensitivity analysis (pick worst case? median case?).
I have been trying to unpick the source of my misunderstanding. I am more familiar with the concept of asking individual patients about what outcome(s) they fear from a diagnosis (e.g. premature death within y years). The severity of the disease postulated by the diagnosis has an important bearing on the probabilities of these outcomes of course. I therefore consider estimates of the probability of the outcome conditional on disease severity with and without treatment (e.g., see Figures 1 and 2 in https://discourse.datamethods.org/t/risk-based-treatment-and-the-validity-of-scales-of-effect/6649?u=huwllewelyn ).
I then discuss at what probability difference the patient would accept the treatment. Initially this would be in the absence of cost and adverse effects to be discussed later, perhaps in an informal decision analysis. If the patientâs choice was 0.22-0.1 = 0.12 (e.g. at a score level of 100 in Figures 1 and 2 above), then this difference could be regarded as the minimum clinically important probability difference (MICIpD for that particular patient. The corresponding score of 100 would be regarded as the minimum clinically important difference (MCID) in the diagnostic test result (e.g. BP) or multivariate score. .
There will be a range of MICpDs and corresponding MICDs for different patients making up a distribution of probabilities and scores. with upper and lower 2 SDs of the score on the X axis on which the probabilities are conditioned. The lower 2SD could be regarded as a the upper end of a reference range that replaces the current ânormalâ range. This lower 2SD could chosen as the MCID for a population with the diagnosis for RCT planning. For the sake of argument I used such an (unsubstantiated and imaginary) BP difference from zero as an example MCID in my sensitivity analysis. I am aware that there are many different ways of choosing MCIDs of course.
In my âpower calculations for replicationâ I estimate subjectively what I think the probability distribution of a study would be by estimating the BP difference and SD (without considering a MCID). I then calculate the sample size to get a power of replication in the second replicating study. If this estimate was a huge number and unrealistic I might reconsider the RCT design or not do it! The sample size should be triple the conventional Frequentist estimate for the first study. Once some interim results of the first study become known then these can be used to estimate the probability of replication in the second study by using the observed difference and SD so far in that first study and applying twice its variance. Some stopping rule can be applied based on the probability of replication as suggested in the paper flagged by @R_cubed (Power Calculations for Replication Studies (projecteuclid.org) ). The original estimated prior distribution could be combined in a Bayesian manner with the result of the first study to estimate the mean and CI of a posterior distribution. However if I did the same for estimating the probability of replication in the second study, I might over-estimate it. I would be grateful for advice about this.
I will offer an example of the principles discussed in my previous post that outlines a difficult problem faced by primary care physicians in the UK. There is a debate taking place about the feasibility of providing the weight reducing drug Mounjaro (Trzepatide) on the NHS. People without complications of obesity already were recruited into a RCT if they had a BMI of 30 and upwards [1]. The average BMI of those in the trial was 38. On a Mounjaro dose of 5mg weekly, there is a 15% BMI reduction on average over 72 weeks. If the dose was 15mg, there was a 21% BMI reduction. The primary care physicians in the UK are concerned about the numbers of patients that would meet this criterion of a BMI of at least 30 and that their demand for treatment might overwhelm the NHS for questionable gain.
The decision of the patients to accept treatment might depend on the beneficial cosmetic effect of weight reduction. It would be surprising if the NHS could support Mounjaroâs use for this purpose alone. However, it could support a risk reduction in the various complications of obesity that might reduce quality or duration of life and potential for employment. However, this information is not available as the BMI was used as a surrogate for this. The black line in Figure 1 is a personal âBayesianâ estimate (pending availability of updating data) of the probabilities conditional on the BMI of at least one complication of obesity occurring within 10 years in a 50 year old man with no diabetes, or no existing complication attributable to obesity. Figure 1 is based on a logistic regression model.
The blue line in Figure 1 shows the effect on the above probabilities of Mounjaro 5mg injections weekly for 72 weeks reducing the BMI by the average of about 6 at each point on the curve (i.e. 15% at a BMI of 38) as discovered in the trial. This dose shifts the blue line by a BMI of 6 to the right for all points on the curve. The red line shows the effect on these probabilities of Mounjaro 15mg reducing the BMI by 8 at each point on the curve (i.e.by 21% at a BMI average of 38 as discovered in the trial). Shifting the curves by a constant distance at each point on the curve gives the same result as applying the odds ratios for the two doses at a BMI of 38 to each point on the placebo curve.
Figure 1
Figure 2
Figure 2 shows the expected risk reduction on Mounjaro 5 and 15mg weekly at each baseline BMI. The greatest point risk reduction 0.18 is at a BMI of 38. At a BMI of 30, the risk reduction is 0.03. At a BMI of 35, the risk reduction is about 0.12. The dotted black lines in Figures 1 and 2 indicate an estimated âBayesianâ probability distribution (pending updating data) of BMI in the population. Moving the threshold for treatment from 30 to 35 would reduce the populations treated substantially. There will be stochastic variation about these points of course.
Curves such as those in Figures 1 and 2 would have to be developed for each complication of obesity. If a decision to take Mounjaro is shared using a formal decision analysis, the probability of each complication conditional on the individual patients BMI and its utility has to be considered as well as the demands of weekly injections possibly for life. In the USA, this would also involve the cost of medication and medical supervision. The decision analysis would have to compare the expected utilities of Mounjaro, lifestyle modification and no intervention at all.
Is this a fair representation of the difficult problem faced by primary care physicians in the UK when trying to interpret the result of the Mounjaro RCT?
- Jastreboff et al. Tirzepatide Once Weekly for the Treatment of Obesity. N Engl J Med 2022;387:205-216.
Regarding this topic, I have just posted the reply below to @Stephen on Deborah Mayoâs Error Statistics blog. I would be grateful for comments.
Thank you, Stephen, for making time to reply. I think that we agree on the points that you make. I will summarise the thinking in my paper (https://arxiv.org/pdf/2403.16906) to save you time for when you (or anyone else) can read more:
My understanding of the replication crisis is that
A. As you say, basing the frequency of replication on one second study of the same size gives a pessimistic result â the estimated frequency should be based on a second study of infinite samples size to simulate a âlong viewâ (see the first paragraph below).
B. My reasoning leads me to wonder that the current method of estimating required sample sizes overestimates power and underestimates the sample size required (by half), thus compounding the replication crisis (see the 2nd paragraph below).
C. An underestimate of sample size, and regarding a single identical study as a basis for replication, may explain the replication crisis (see the 3rd paragraph below) .
If a real first study result was based on 205 observations and its distribution happened to have a mean / delta of 1.96 and a SD of 10 then assuming a uniform prior, the Gaussian probability distribution of the possible true means conditional on this data has a mean also of 1.96 and a variance of 10/â205 = 0.698. If we consider all the possible likelihood distributions of the observed results of the second replicating study conditional on each possible true values, this represents the convolution of two Gaussian distributions. Therefore, the resultant probability distribution of the expected observed values in the second replicating study also getting P ⤠0.05 has a variance of 0.698+0.698 = 1.397 and a SD of â1.397= 0.988. Based on these calculations, probability of getting P ⤠0.05 again in the second replicating study is 0.5 as suggested by Goodman. However, if the sample size of the second study is infinitely large as suggested by you Stephen to simulate a âlong viewâ, then the variance and SD of the second replicating study is zero and there is an overall variance is 0+0.698 and an SEM of 0.698 giving a probability of 0.8 of replication with P ⤠0.05 again.
Conventional sample size calculations use a point estimate of the true mean / delta, (e.g. of 1.96) and an estimate of a SD (e.g. of 10) of the likelihood distribution of the imagined data in the âthought experimentâ on which the delta and SD were based. We can then estimate the sample size needed to get the necessary SEM to achieve a specified P value for the thought experiment data and the real study. However, instead of a likelihood distribution of the thought experiment data conditional on as single point estimate of a true mean of 1.96, we regard the above estimated distribution as a Gaussian probability distribution of the possible true means conditional on the thought experiment data. Then, the SEM of the data distribution in the real study after a convolution of two Gaussian distributions will be twice that of the above SEM (e.g. â0.698 x 2 = â1.397 = 0.988 as in the preceding paragraph giving a probability of 0.5 of getting P ⤠0.05 in the real study of the same size as the 'thought experiment (the same as replication for a study of the same size in the preceding paragraph). Therefore, to expect P ⤠0.05 in the real study with a probability of 0.8, we need a sample size of approximately twice 205 = 410. This sample size will then provide a correct probability of 0.8 of also getting P ⤠0.05 in the real study based on two variances. If a replicating study is of theoretically of infinite sample size as suggested by you Stephen, the probability of replication will be 0.8 also. In this situation, there is no apparent replication crisis provided that the power calculation is based on two variances.
The real study must have the same sample size as that estimated for the âthought experimentâ. This suggests that a conventional sample size estimate of 205 will be underpowered, only providing a probability of about 0.5 of getting P ⤠0.05 in the real study. The probability of getting P ⤠0.05 also in the real studyâs replicating study conditional on the thought experiment and an underestimated sample size of 205 will depend on 3 variances and will provide a probability of replication of approximately 0.367, which is what was discovered in the Open Science Collaboration study. This suggests that the currently conventional estimates of sample size (e.g. giving a sample size of 205 as in the first paragraph) will cause real studies to be underpowered (e.g. at about 50% instead of the desired 80%) and the frequency replication of a single study of the same size will be alarmingly, but incorrectly, low (e.g. about 0.367). This underestimate of sample size, and regarding a single identical study as a basis for replication, may explain the replication crisis.
This avoids your excellent questions, but I feel that the use of null hypotheses is at the center of the replication crisis. A Bayesian sequential design like this one IMHO avoids much of the problem by demanding that 3 criteria be met:
- high probability of effect > 0
- moderately high probability of effect > trivial
- sufficient precision in estimating the treatment effect
Bayesian sequential designs also invite us to think about study continuation rather than study replication.
Thank you Frank
The probability of getting a P ⤠0.05 or P ⤠0.025 based on the null hypothesis threshold of course can be replaced in my reasoning by a probability of >0.975 of the true result being less extreme than the null hypothesis or a probability of > 0.95 that the true result will be within the lower limit less extreme than the null hypothesis and the upper limit 2x1.96 SEMs less extreme that the null hypothesis (i.e. corresponding to 95% confidence interval). I can also replace the null hypothesis by some other less trivial threshold that takes account of clinical significance in the form of a greater required effect in the real study that is about to take place.
When the real result of a study has been observed, we can estimate the probability of replication with the same sample size for our chosen threshold without simply doubling it. The alternative calculation would be based on the convolution of the âBayesian likeâ first thought experiment distribution used in planning, with the second observed distribution. The sample sizes would be the same in both distributions but the resulting mean of the final distribution should be different to that obtained by doubling the variance. It would also be possible to apply Bayes rule to combine the thought experiment distribution with the real resultâs distribution to double the number of observations in the ârealâ result and halve the variance of the second distribution to be combined with the initial thought experiment in the convolution calculation.
Perhaps all the above could be used as part of a Bayesian sequential analysis too. I have not read the Bayesian Clinical Trial Design paper in detail but does it implicitly apply double the sample size obtained from a conventional Frequentist calculation in the way that I have done?
The more I read through this thread, the less clear it is to me on what is being conditioned upon, (ie. treated as fixed), and what is being allowed to vary.
If by âreplicatingâ a study, you mean a future estimate is âcloseâ (with âcloseâ being undefined for now) to a previously reported estimate, that would require some way of down weighting the observed result, because treating a sample estimate as the true parameter almost certainly overestimates the confidence we should have in our estimate of a treatment effect. This is easy to do in a Bayesian framework, but less clear (to me at least) how to do so in a frequentist sense.
If by âreplicateâ you mean âobtain p < 0.05 and the estimates have the same signâ you face a similar problem. For the sake of argument, we will ignore that. Conditioning on the observed estimate (ie. treating the estimate as the true parameter value), we should expect at least half of our future studies to fail to achieve the same p-value (although they will likely have the same sign relative to N(0,1)), since the one-tailed p-value of the MLE is 0.5 (ie. the 50th percentile) in the N(\theta,1) scenario, where \theta \ne 0.
The fundamental problem is granting the p-value excess importance. It is the estimate that is a sufficient statistic, not the p-value, which is directly related to sample size and can change from sample to sample. It seems strange to define âreplicationâ in terms of the realization of a uniformly distributed random quantity.