The more I read through this thread, the less clear it is to me on what is being conditioned upon, (ie. treated as fixed), and what is being allowed to vary.
If by “replicating” a study, you mean a future estimate is “close” (with “close” being undefined for now) to a previously reported estimate, that would require some way of down weighting the observed result, because treating a sample estimate as the true parameter almost certainly overestimates the confidence (in a frequentist sense) we should have in our estimate of a treatment effect. This is easy to do in a Bayesian framework, but less clear (to me at least) how to do so in a frequentist sense without a lot of data.
See this thread for a closely related discussion:
If by “replicate” you mean “obtain p < 0.05 and the estimates have the same sign” you face a similar problem. For the sake of argument, we will ignore that. Conditioning on the observed estimate (ie. treating the estimate as the true parameter value), we should expect at least half of our future studies to fail to achieve the same p-value (although they will likely have the same sign relative to N(0,1)), since the one-tailed p-value of the MLE is 0.5 (ie. the 50th percentile) in the N(\theta,1) scenario, where the non-centrality (or shift) parameter \theta \ne 0.
The fundamental problem is granting the p-value excess importance. It is the estimate that is a sufficient statistic, not the p-value, which is directly related to sample size and can change from sample to sample even if the sample size is kept constant under repeated sampling. It seems strange to define “replication” in terms of the realization of a uniformly distributed random quantity.