Yes. But instead of b_repl, I an interested in the possible outcomes of all studies with the same s but n2 (which can be and usually is the same as n1 in replication studies when the planned study has the same sample size, but can be very different (e.g. infinity theoretically) conditional on all the possible values of beta. How does your b_repl relate to beta?
@HuwLlewelyn: It looks like you are trying to derive the formula for a prediction interval. Study the following and tell me if this is helpful.
Just to be sure, do we agree on the following set-up?
We start with beta which is the unkown true effect. We have two studies; the âoriginalâ and the âreplicationâ that both target beta. Letâs assume they have the same sample size n and standard deviation s. Suppose they yield estimates b and b_repl which are unbiased with the same standard error se=s/sqrt(n). In other words, conditionally on beta, b and b_repl are independent, normally distributed with mean beta and standard deviation se.
You asked:
How does your b_repl relate to beta?
As I just wrote, conditionally on beta, b and b_repl are independent, normally distributed with mean beta and standard deviation se.
Now, in an actual field of research (such as the clinical trials in the Cochrane database) there is an association between the true effects (betaâs) and the standard errors (seâs). This is due to the practice of sample size calculations. For that reason, I prefer to divide everything by se and work with SNR=beta/se, z=b/se and z_repl=b_repl/se instead of beta, b and b_repl. If you find this confusing, you can also assume that se is always 1. Is that OK with you?
Iâm not sure where you are going with this, but se = 1 when n belongs to the set of all n when n = s^2.
Yes in principle, but this discussion so far focuses on the interval z_repln > 1.96.
Iâm not sure where you are going with this
Weâll get there!
We start with beta which is the unkown true effect. We have two studies; the âoriginalâ and the âreplicationâ that both target beta. Letâs assume they have the same sample size n and standard deviation s. Suppose they yield estimates b and b_repl which are unbiased with the same standard error se=s/sqrt(n). In other words, conditionally on beta, b and b_repl are independent, normally distributed with mean beta and standard deviation se.
Now define SNR=beta/se, z=b/se and z_repl=b_repl/se. It follows that conditionally on SNR, z and z_repl are independent, normally distributed with mean SNR and standard deviation 1.
All agreed?
OK. So what are the distributions of b and b_repl conditional on beta?
Iâve said a few times: Conditionally on beta, b and b_repl are independent, normally distributed with mean beta and standard deviation se.
Also, conditionally on SNR, z and z_repl are independent, normally distributed with mean SNR and standard deviation 1.
Whatâs not clear?
I thought that you were about to give some example values, e.g. b=2, se = 1, SNR, etc. as you suggested using se = 1 in your numerical examples. Carry on with numerical examples please that make it easier for mw to understand differences from my approach.
We now agree on the following set up: We start with beta which is the unkown true effect. We have two studies; the âoriginalâ and the âreplicationâ that both target beta. Letâs assume they have the same sample size n and standard deviation s. Suppose they yield estimates b and b_repl which are unbiased with the same standard error se=s/sqrt(n). In other words, conditionally on beta, b and b_repl are independent, normally distributed with mean beta and standard deviation se. Now define SNR=beta/se, z=b/se and z_repl=b_repl/se. It follows that conditionally on SNR, z and z_repl are independent, normally distributed with mean SNR and standard deviation 1.
We are interested in the conditional probability of a statistically significant replication, given the result of the first study. So, thatâs P(z_repl > 1.96 | b,se).
If I understand correctly, you claim that if we assume the (improper) uniform (or âflatâ) prior for beta, then the conditional distribution of z_repl given b and se is normal with mean z/sqrt(2) and standard deviation 1? In other words
z_repl | b,se ~ N(z/sqrt(2),1).
Note that this conditional distribution depends on b and se only though z.
Check: If I use R to calulate P(z_repl>1.96 | z) according to your formula for a few values of z, then I get your numerical results:
z=c(0.67,1.04,1.64,1.96,2.17,2.58,2.81,3.29)
1 - pnorm(1.96,z/sqrt(2),1)
0.07 0.11 0.21 0.28 0.34 0.45 0.51 0.64
So, is this indeed what you claim?
No. For z=1.96 that expression, if I read correctly as 1-pnorm(1.96-1.96)/sqrt(2), 1), gives 0.5, not 0.28 (sorry. I inserted 2 instead of 1.96 the first time to get 0.5113).
1 - pnorm(1.96,1.99/sqrt(2),1) = 0.29.
or, if you prefer,
1 - pnorm(1.96 -1.99/sqrt(2),0,1) = 0.29.
Make sure you mind the brackets!
But 1-pnorm(1.96-1.96)/sqrt(2)) = 0.5 and 1-pnorm(1.96-1.99)/sqrt(2) = 0.508 and 1-pnorm(1.96-2)/sqrt(2) = 0.511. However, pnorm(1.96/sqrt(2) - 1.96) = 0.283
But 1-pnorm(1.96-1.96/sqrt(2)) = 0.5
No,
1-pnorm(1.96-1.96/sqrt(2)) = pnorm(1.96/sqrt(2) - 1.96) = 0.28
This is silly.
Yes it is what I claim. However, what I am used to seeing is p(z_repln>1.96|z) = pnorm((1.96-z)/sqrt(2)). So Iâm sorry that I missed the subtle difference where you replaced pnorm((1.96-z)/sqrt(2)) with 1-pnorm(1.96-z/sqrt(2)). As you say, this gives the same result as my version of pnorm(1.96/sqrt(2) - 1.96), which is a simplified version of p(z_repln>1.96 | b, s, n1, n2) = pnorm(b/sqrt(n1) +s/sqrt(n))-1.96). Your modification and results in the quote gives the same results as those shown in my Table 1 earlier (that appeared to surprise you by corresponding to the result of your 2022 paper with Goodman when I first showed it). So it seems at last that we are in agreement with what I had postulated in my pre-print
My main expression (in notation this time again for clarity) is
p(z_repln>1.96) | b, s, n1, n2) = ÎŚ(b/â(s/ân1+s/ân2)-1.96).
This allows the sample size of the planned replicating study to be varied in a âwhat ifâ way. By inserting a very large number as n2, we can test @Stephenâs reply in 2002 to Stephen Goodmanâs paper of 1992. It also implies that in addition to @Stephenâs suggestion about postulating an infinitely large n2, the replication crisis could be resolved by doubling the sample sizes suggested by current power calculations.
PS. This might stop the FDA for asking for two positive trial results.
Yes it is what I claim.
Excellent! Weâre making progress! We already agreed on the following set up:
We start with beta which is the unkown true effect. We have two studies; the âoriginalâ and the âreplicationâ that both target beta. Letâs assume they have the same sample size n and standard deviation s. Suppose they yield estimates b and b_repl which are unbiased with the same standard error se=s/sqrt(n). In other words, conditionally on beta, b and b_repl are independent, normally distributed with mean beta and standard deviation se. Now define SNR=beta/se, z=b/se and z_repl=b_repl/se. It follows that conditionally on SNR, z and z_repl are independent, normally distributed with mean SNR and standard deviation 1. We are interested in the conditional probability of a statistically significant replication, given the result of the first study. So, thatâs P(z_repl > 1.96 | b,se).
Within this set-up, we have now also established Llewelynâs claim:
If we assume the (improper) uniform (or âflatâ) prior for beta, then
z_repl | b,se ~ N(z/sqrt(2),1).
Now, there are several reasons why I donât agree with this claim. The first reason is relatively minor. As I explained, the standard error se has information about beta due to the practice of sample size calculations. To account for this, we would need a joint prior for beta and se. However, we can finesse this difficulty by dividing everything by se. So, we arrive at Llewelynâs modified claim:
If we assume the (improper) uniform (or âflatâ) prior for SNR, then
z_repl | z ~ N(z/sqrt(2),1).
Is this still a fair representation of your claim?
I need to reflect on this during the day and to understand better where you are going with it. Is this expression derived from Goodmanâs 1-pnorm((z*-z)/sqrt(2)) or the latest 1-pnorm(z*-z/sqrt(2))?
Llewelynâs modified claim: If we assume the (improper) uniform (or âflatâ) prior for SNR, then
z_repl | z ~ N(z/sqrt(2),1).
This would imply
P(z_repl > 1.96 | z) = 1 - pnorm(1.96 - z/sqrt(2),0,1)
and in particular
P(z_repl > 1.96 | z=1.96) = 1 - pnorm(1.96 - 1.96/sqrt(2),0,1) = 0.28.
Sorry for the delay in responding. I have been reflecting in much detail about the difference between Goodmanâs expression and the one used by me and trying to tease out the various implications.
Firstly, the simplified version of my expression is not
P(z_repl > 1.96 | z) = 1 - pnorm(1.96 - z/sqrt(2)) when z = b/se
but my simplified version is
P(z_repl > 1.96 | b, se) = 1 - pnorm(1.96 - b/(sesqrt(2)))
or preferably
P(z_repl > 1.96 | b, se) = pnorm(b/(sesqrt(2)) - 1.96)
So in my case, z = b/(se*sqrt(2)) not z = b/se
Secondly, why do you state âmodified claimâ?
Firstly, the simplified version of my expression is not
P(z_repl > 1.96 | z) = 1 - pnorm(1.96 - z/sqrt(2))
Strange that you should say that, because the formula does give your numerical results:
z=c(0.67,1.04,1.64,1.96,2.17,2.58,2.81,3.29)
1 - pnorm(1.96 - z/sqrt(2),0,1)
0.07 0.11 0.21 0.28 0.34 0.45 0.51 0.64
In particular, the formula gives P(z_repl > 1.96 | z=1.96) = 0.28 as you have repeatedly claimed.
Now, you are now introducing a different formula:
P(z_repl > 1.96 | z) = 1 - pnorm(1.96 - b/sqrt(v1+v2),sqrt(2)) when v1=v2.
For example, suppose b=1.96, v1=1 and v2=1. Then the z-statistic of the first study is z=b/sqrt(v1)=1.96, but your new formula does not yield 0.28.
Secondly, why do you state âmodified claimâ?
Please read my comment 136.