If you could fix one topic in Modern Statistics, what would it be?

Imagine you could erase one misconception or one bad practice in modern statistics with a magic wand: Which one would you choose?

My answer is trivial, treating non-significant results as failures makes me doubt the entire research ecosystem :confused:

https://x.com/JohnHolbein1/status/1985838196647694845

4 Likes

Having everyone understand that \alpha is not the probability of being in error.

9 Likes

Much like yours, Uriah: Treating and reporting results with p > 0.05 (or whatever cutoff was chosen) as “supporting the null”, “showing no effect”, “demonstrating safety” [in studies of harms], and all the other nullistic interpretations of “statistically nonsignificant” results.

The earliest article I know warning against such misinterpretations is Pearson K (1906) Note on the significant or non-significant character of a subsample drawn from a sample. Biometrika 5:181-183. Despite such early warning, from what I can tell the misinterpretations only got worse and more diverse over the ensuing century, especially after NHST came to dominate research reporting.

7 Likes

For those looking for a further elaboration of this argument about \alpha, see this thread:

If you study closely these 3 papers:

  1. Evans, M. (2016). Measuring statistical evidence using relative belief. Computational and structural biotechnology journal, 14, 91-96. (HTML)

  2. Bayarri, M. J., Benjamin, D. J., Berger, J. O., & Sellke, T. M. (2016). Rejection odds and rejection ratios: A proposal for statistical practice in testing hypotheses. Journal of Mathematical Psychology, 72, 90-103. (HTML)

  3. Rafi, Z., & Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: replace confidence and significance by compatibility and surprise. BMC medical research methodology, 20(1), 244. (HTML)

you can see (with some simple algebra) that \alpha is a ratio of what Evans calls the relative belief ratio divided by frequentist power. Bayarri et. al. make a similar argument. This makes \alpha a measure of discrepancy from a particular reference point \theta = 0) prior to the data being seen. After the data, the p value is the observed discrepancy (or information) by placing the sample estimate in relation to some quantile of an assumed distribution.

It should not take 3 papers to discuss these relationships!

There was an extensive discussion on p values in this video that deserves review for those of us who were not formally trained in statistics. James Berger in the first half hour describes how to use Bayesian reasoning to justify your \alpha level, and how this improved replication in a certain area of genetic research.

More videos related to p values are found in this thread:

What I wish statistics courses made clear are the important relationships between Bayesian and Frequentist methods. I’d go as far as to say (based on Wald and Duanmu’s complete class theorems) that Frequentist methods are important special cases of Bayesian methods.

Likewise, Frequentist methods (via simulation) can aid us in checking the validity and impact of our assumptions before taking action based upon an analysis. Frequentists have figured out a number of computational methods to use when doing a direct Bayesian calculation is not easy or obvious.

For some food for thought, read a few of the papers that discuss stat foundations, which can be found here: Bayesian, Fiducial and Frequentist = BFF = Best Friend Forever

4 Likes

I don’t agree that frequentist statistics is a special case of Bayesian stats. Frequentist uses sampling distributions (Bayes doesn’t) and computes probabilities that are night and day different from what Bayes computes. Also, \alpha cannot possible be a function of frequentist power since power is a function of the effect size not to miss and \alpha is computed under an assumption that the effect size is exactly zero.

\alpha is most easily understood as a long-run relative frequency of a smoke alarm going off when there is no smoke. It is an assertion trigger probability, almost completely devoid of consideration of decision error probabilities (see 22  Controlling α vs. Probability of a Decision Error – Biostatistics for Biomedical Research).

1 Like

A few responses:

Note: I corrected the text for an algebra error. Since power is bounded from above by 1, and \alpha is a quantile in the 0-1 range, frequentist power needs to be divided by the product of the prior and posterior odds. For James Berger’s discussion of prior odds and frequentist procedures, start watching here at 18:08.

If you follow along with Berger’s manipulations in the video, \alpha is a function of both the ratio of the the reciprocal posterior divided by prior multiplied by of the power. After the data are collected and the statistics are computed, the observed estimate should be far in the tail of the reference model centered at \theta = 0.

Why not? Sure, you can create “frequentist” procedures that make no sense but meet the criteria that frequentist statisticians put forth. But I’m referring to reasonable frequentist procedures that are within the admissible class of procedures that might be considered.

If you consider Jose Bernardo’s definition of reference priors, they merely are used to maximize the weight of the observed data in a proper posterior.

In the first part of the following video, Michael I Jordan discusses the fundamental unity of these 2 perspectives. Are you saying he is fundamentally mistaken in the section at 13:22 - 19:30?

“The Bayes risk is gotten by either the frequentist path or the Bayesian path.” (19:00 - 20:00)

Berger’s discussion of the GWAS example in this video is incomplete. The setting of the threshold, on its own, did not solve the multiple testing problem. The GWAS community further decided that no paper could be published unless both a “discovery” and a “validation” (ie, confirmatory study) had been completed successfully, then both would be presented within the same paper. This moved GWAS closer towards the “learn and confirm” paradigm that has been used in phased clinical trials for decades.

5 Likes

Would you find it unfair to say that the Bayesian decision analysis conducted in the GWAS example (justifying a very low \alpha for individual tests) assisted that scientific community into directing resources more efficiently when conducting validation studies? That sounds like a fair description to me.

I’m not so swayed by the fact that you can find bridges between the two schools of thought for certain simple special cases. I put all my emphasis on the philosophy and practice involved in the two schools. Since the probabilities computed are unrelated to each other in the sense that one views the other as using transposed conditionals, one is not a special case of the other. Their philosophies are fundamentally different. One has visible subjectivity and the other has mysterious hidden subjectivity.

This is an extremely inefficient design, but is useful if you don’t trust the investigators to do a rigorous overfitting-corrected (debiased) efficient internal validation. “External” validation has led to an amazing waste of time while we wait for invalidation that could have been detected much earlier with rigorous resampling approaches.

1 Like

Your point is not obvious to me. One difference between clinical trials and GWAS is that the latter is a screening procedure, with the “discovery” phase as the hypothesis generating step, and the “validation” phase the testing of those hypotheses. Clinical trials are usually preceded by extensive preclinical development. So the analogy I drew is not ideal in that respect.

More generally, I must not be understanding what you mean by external validation wasting time. Can you please give an example? Even a good internal validation does not prevent the need for shoe leather.

1 Like

Perhaps @f2harrell ‘s very baffling comment is because my earlier comments were also incomplete. The GWAS community implemented numerous reforms, years ago, ranging from use of standard experimental protocols, QC procedures, detection and correction for batch and center effects, etc. etc., in addition to what I said earlier. A high level overview is given here:

It seems to me all of these reforms worked together to improve the credibility of the field. I would like to learn of any examples, post-reform, of GWAS studies where the “validation stage” created “an amazing waste of time”.

2 Likes

I’m not qualified to answer; we should ask someone who actually works in GWAS.

In this context, I don’t see any alternative to replicating the initial experiment. But it is deceptively hard to do well.

If we look at a scientific investigation from the perspective of a communication protocol, error detection and correction strategies depend upon the assumed error rate. Gerard Holtzman wrote a classic text Design and Validation of Computer Protocols where he discusses error detection in chapter 3.

An error detection method can only work by increasing redundancy of messages in some well defined way … Apart from detecting errors, the receiver must be able to correct the errors. There are two ways: in which this can be done: 1. Forward error control, 2. Feedback error control.

Replication studies could be considered a form of feedback error control. He goes on to make an interesting point that should be kept in mind regarding the protocol for validating GWAS studies:

If p [the probability of error] approaches one, a retransmission scheme would be a bad choice; almost every message would be hit.

This is consistent with the premise of the following article. There are a number of sources of uncertainty that need to be accounted for in order for a GWAS report and its replication to be credible.

From the abstract (with bold portions my emphasis):

With the advent of genome-wide association (GWA) studies, researchers are hoping that reliable genetic association of common human complex diseases/traits can be detected. Currently, there is an increasing enthusiasm about GWA and a number of GWA studies have been published. In the field a common practice is that replication should be used as the gold standard to validate an association finding. In this article, based on empirical and theoretical data, we emphasize that replication of GWA findings can be quite difficult, and should not always be expected, even when true variants are identified. The probability of replication becomes smaller with the increasing number of independent GWA studies if the power of individual replication studies is less than 100% (which is usually the case), and even a finding that is replicated may not necessarily be true. We argue that the field may have unreasonably high expectations on success of replication. We also wish to raise the question whether it is sufficient or necessary to treat replication as the ultimate and gold standard for defining true variants. We finally discuss the usefulness of integrating evidence from multiple levels/sources such as genetic epidemiological studies (at the DNA level), gene expression studies (at the RNA level), proteomics (at the protein level), and follow-up molecular and cellular studies for eventual validation and illumination of the functional relevance of the genes uncovered

The holdout-sample form of validation, frequently mislabeled as “external” brings in a huge source of unaccounted for variation when the effective sample size < 20,000 (for example): the luck of the split. This is an unstable process.

Even though GWAS is a screening procedure and not a risk prediction exercise per se, the use of split-samples in this context has been a disaster because it has lulled investigators into feeling that false discovery rates are much more important than false negative rates and many important genetic variants are being missed in the desire for parsimony. Resampling easily exposes the difficulty of the feature selection task, as discussed here.

I have challenged GWAS investigators for almost 20 years to do an exercise in which they use a state-of-the-art statistical method on the non-selected variants to see if the “losers” have more information about phenotypes than the forced-parsimonious list of winners. No one has taken up this challenge.

GWAS has led to trivialization of the genome and failure to achieve decent predictive discrimination by adopting a screening approach with a naive one-feature-at-a-time testing procedure.

In this context, I understood “external validation” as actually collecting a true independent sample. I understand your complaint to be about split-sample statistical approaches and incorrectly calling them “external.”

Aside from that, I think there is a lot of insight into your criticisms. I’m especially suspicious of their use of the Bonferroni correction, which is the most conservative adjustment you can do. Perhaps it can be defended for use in the initial discovery study. But I’d expect true replication studies to use more powerful adjustment methods.

This leads me to a bigger question, why do poor methods remain so persistent in practice?

1 Like

Poor methods are persistent because journals already set precedent by allowing papers with incredibly poor methods to be publish since they have incompetent methodology peer review, and because poor methods are easier to use than high-performance methods.

You’re right some published research actually did truly external validation, but many just split one stream of patients. “External” validation is not even a great idea though, because these are frequently done in other heath care or geographic region settings, or in another time span. Failure to validate may represent nothing more than a need for time trends or geographic location to be included in a comprehensive model.

3 Likes

@f2harrell I see, thank you for the clarification. I agree that several of the points you raise are sound. The original point I was making is simply that Berger’s account is incomplete and ignores many, many nuances, and is potentially misleading to an uninformed viewer. It is good to see some of these nuances brought forth on this thread. The main point for me is that improving statistical methodology is not sufficient - there must be changes in the quality of measurement processes, experimental design and execution, and (as Xihong Lin’s above-linked article emphasizes), major changes in “incentives and culture”.

@R_cubed It is my understanding that the “validation” step is not a literal replication of the original study; rather a confirmatory study that tests the hypotheses generated during the discovery step. (For example @f2harrell said they don’t go back and look at the “losers” from the discovery step; instead the “validation” step focuses on the “winners” to see if they really are “winners”.) I was very careful not to use the term “replication” in my posts above.

2 Likes

I understood that to be true external validation – ie. using a new sample. I presume that is what you meant as well, or perhaps I misunderstood.

My intuition was similar to yours – true external validation is necessary (using a new sample). But Frank’s points are also important. It is wasteful to validate a procedure that has inherently low power, which is explained in the PLOS article I linked to above.

I’ve slowly discovered there are a very large number of areas of research that could benefit from such a change if the goal was honest inquiry which I have come to seriously doubt.

1 Like

No, that’s not what I meant. The “validation” step also has a different study design than the “discovery” step, because it’s addressing a different question, as I alluded to with the “winners” and “losers” above. The concept somewhat resembles the one described by Mogil & Macleod, 2017: No publication without confirmation | Nature

I’m very sorry that the article is paywalled.

1 Like

I get that. My question is – is the information for validation coming from a new sample, or is it a re-analysis of the original data? I assumed new data would be collected, and only the markers that generated a signal in the initial study would be tested again.

I was able to find a short segment of the paper you referred to here:

No publication without confirmation