What is type I error?

Good questions Drew. Come to think of it I don’t like multiplicity corrections in general, and anything that derives from p-values. And is false discovery rate even the correct terminology? I’m forgetting my statistical history now—I can’t remember if it attempts to estimate the proportion of true non-null effects or the proportion of non-null assertions. At any rate, it’s not really a rate but is rather a proportion or probability.

Regulator’s regret is an interesting term that regulators for too long have assumed means type I error. But in fact, and apropos of the original posting above, it is really the condition of approving a drug that doesn’t work (there is also the opposite regret of missing a good drug). The probability of regulator’s regret is the probability the treatment has no effect or harmful effect, so it’s not type I error.

For large-scale problems my biggest concern with FDR is that it doesn’t actually work. It lulls researchers into a false sense of security, makes them miss real effects, and fails to recognize that the feature selection method being used has no chance of finding the “right” features.

1 Like

This paper could be helpful:

One of the very few that does not mix up Fisher’s significance tests (where there is no type-I error defined!) and Neyman’s acceptance tests (where you neccesarily have two different types of errors [I _and_ II]).

Perezgonzalez made a nice statement in a note there (bold emphasis mine):

“As H0 is always true (i.e., it shows the theoretical random distribution of frequencies under certain parameters), it cannot, at the same time, be false nor falsifiable a posteriori. Basically, if at any point you say that H0 is false, then you are also invalidating the whole test and its results. Furthermore, because H0 is always true, it cannot be proved, either.”

I always argue that H0 is always false, because in reality any infinitely small difference to H0 means that H0 is false. But this is actually not the point here. The test assumes H0. This is not, at no time, referring to something real. The p-value remains a statistic of the data assuming H0. We thus can not wrongly reject H0. We may or may not reject H0. There is no “correct-or-false” property associated with this. From the perspective of the model (which assumes H0), a rejection is “false” by definition, and from the prespective of reality, a rejection is “correct” by definition (except, maybe, for some carefully constructed or theoretical cases).

4 Likes

Frank, thanks for starting this great discussion. Being a physician, would it be fair to say that your point is similar to the following diagnostic problem:

A diagnostic test is often performed on a person suspected of the disease, rather than a random person. So imagine a neurologists who typically orders an MRI to really confirm a serious autoimmune brain disease that is already by appropriate signs, symptoms and a positive CT-scan. Previous research has demonstrated that of twenty negative patients, one patient will have a false positive result. Despite this 5% probability, no experienced neurologist would expect 1 in 20 patients for whom she orders an MRI to have a false positive result (although admittedly, the neurologist might not be able to articulate it that explicitly). Her patients are way too disease-suspect for that. A positive MRI hardly has any relevance to the situation. It is the occasional negative MRI that she’s aiming for in this situation. Again, it might not be easy for an experienced clinician to articulate it this way, but they still “know” it nonetheless in my experience.

Is my comparison (more or less) correct?

I’m not sure but I think so. But I find discussions of “1 in 20” and false positives to always be confusing. I stick to actionable posterior probabilities that are predictive in nature, as required by decision makers. So my suggestion is to acquire a well-calibrated risk prediction tool for that clinical scenario and to use the estimated risks it produces without labeling anything negative or positive.

Thanks Frank! My main intention was to create a verbal, clinical analogy so I would be better able to explain your issue with alpha to my colleague physicians. Sorry that my intention wasn’t clear.

Of course it would be better to discuss it in proper probabilities. However, trying to transform an entire clinical department into a Bayesian reasoning machine is (at least for me) not going to be successful overnight. Verbal analogies help with priming our clinical brains for future truly-Bayesian reasoning.

I’ll try to sharpen my analogy a but more, but it kind of reassures me that you do not immediately shoot a hole in it.

2 Likes

That does not need any Bayesian interpretation. It is only a matter of the reference population. The statements “1 in 20 false positives” gives the MRI refers to the population of people not having the disease. But when, in clinical practice, MRIs are done only in people already showing several signs of the disease, the reference population here is one with a considerable proportion of diseased people. Running the test on such a population will give you less than “1 in 20 false positives”. This can all be explained with a pure frequentist interpretation of probability.

1 Like

Yes, but the difference between the two reference population is part of Bayes’s Theorem: false positive MRI in non-diseased people is the likelihood P(test = 1|D = 0); false positive MRI in people suspected of disease refers to the posterior probability P(D = 0| test = 1) in patients with a high prior probability due to signs, symptoms, CT.
Hence I was referring to Bayesian reasoning, which is very compatible with frequentist stats. I was not referring to Bayesian stats.

Yes, that’s Bayes theorem. But applying Bayes theorem (to go from one population to another) is not Bayesian reasoning. It’s still about frequencies of events, not about probabilities of a subject being diseased.

I think, in practice most clinicians are Bayesians “by heart”, as they think in terms of “what is the probability that this patient has the disease (given this and that)?” rather than “what is the probability of a (randomly sampled) patient having the disease (given this and that)?” (the latter being a frequentist-like question). As long as the probability statements refer samples from a population, the philosophical background remains frequentist. Only when probabilities are assigned to features that can not be subject to sampling, the thinking is Bayesian.

2 Likes

On the other hand, the false positive probability is

P(Δ=0 | assert Δ≠0)

This false positive probability is arbitrarily different from type I error, and can only be obtained from a Bayesian argument.

How is the condition, “assert Δ≠0”, expressed in a bayesian argument? Simply by reporting a non-zero summary statistic of the posterior distribution of Δ? For example, would reporting that Δ≠0 because the mean, or median, or mode of the posterior of Δ constitute an assertion that Δ≠0?

Excellent question. Most people believe that the state of knowledge/ignorance is continuous, so don’t place any special meaning on \Delta=0 for the prior distribution. When this is the case, the Bayesian posterior probability that \Delta = 0 is zero so P(\Delta \neq 0) = 1 so no data are needed. So instead of this, the majority of Bayesians compute P(\Delta > 0) as the degree of belief in efficacy in the right direction, and one minus that is the probability of ineffectiveness or harm. In other words, Bayesians avoid point null hypotheses. Clinical relevance involves P(\Delta > \epsilon) for some clinically relevant minimal efficacy \epsilon.

The beauty of probabilistic thinking is that until we go all the way and define a utility function to optimize in a formal decision analytic framework, we don’t need errors or assertions; we can just use the language of uncertainty to make our best statements, e.g., “Treatment B probably (0.96) lowers blood pressure when compared to treatment A, given the prior distribution for efficacy of …”

3 Likes

Thanks! That is a lot more satisfying form of inference.

Your answers are always very very informative and provide a solid pavement for understanding statistics. I remember you from researchgate, as well. If you allow and agree, I propose a minor contribution that I believe can aid those who are not familiar with the differences between the Fisherian vs. Neyman’s approach. I would write this sentence this way:

One of the very few that does not mix up Fisher’s significance tests (In which type-I error is not formally defined to have a solid and rigid cutoff and the type II error was not regarded as even possible) and Neyman’s acceptance tests (where you neccesarily have two different types of errors [I _and_ II]).

I say that because someone can mix up things and believe that Fisher did not work on the “Type I error” concept. In fact, the first time I read your text, it was my impression. Fisher was reluctant to accept the idea of the Type II error. At Fisherian reasoning, It is not required even to define an alternative hypothesis.

This is my first message here. Consequently, I do not know yet the underlying rules of this community. If this is inadequate, please exclude or ignore it.

Thank you.

1 Like

Dear Luis,

thank you for your valuable contribution. I surely have some need of clarification here. I think my point is valid and important, but I may be wrong and willing to learn. So let me respond:

I think your main point is that Fisher worked on the “Type I error” concept, wheras I said that there isnothinglikea “TypeI error” in thelogicof Fisherian tests of significiance.

Myoint is to stress that the calculation of the statistical significance (p-value) is actually not related to any kindof “error”. At no time there is the question if H0 would be wrongly rejected. H0 is known to be wrong even before collecting the data. The p-value is calculated as a standardized measure to see if there is already enough data to “make it obvious” that the data is incompatible with H0, allowing us to interpret the data relative to H0 (in a simple t-test this would mean that we may interpret the sign of (meanA-meanB - H0)). Without using the p-value, we could interpret any result (mean difference), and in the worst case we would get the sign correct with a probability of 50%. Having the wrong sign ist what Andrew Gelmen calls a “type-S error” (S for sign). Using some p-cut off, we won’t dare to interpret unclear data, and for the rest (the clear data) there is alower probability of type-S errors. The only other “error” that could happen is the failure to reject H0, what means that there is not enough data even for such a minimalistic interpretation (the data is inconclusive w.r.t. H0).

This error is not a type-I error. By definition, a type-I error is to accept HB given HA is actually true. In Fisher’s significance tests, the only hypothesis specified is H0 (in NHST unfortuately just taken to play the role of HA, and the alternative is not concrete; it’s jut “not H0”, what is not a well-specified alternative that could play the roleof HB!). So even if H0 is just taken to be HA, there is no HB, and rejecting H0 is not interpretable as “accepting HB” (again: there is no HB).

This is important because deciding for an acceptable type-I error needs a loss-function, and such a loss-function can be defined only with reference to a concrete alternative (HB). So there is no type-I error without HB and without a type-II error.

I think the confusion that “fail to reject H0” would be a type-I error is a great source of misconceptions about tests in general, like the inherent, often indirect “acceptance of H0” when H0 cannot be rejected or the mere “acceptance of ‘not H0’” without considering the “sign” (see above).

The confusion was started with the NHST hybrid approach. If we’d stop teaching thisand instead teach only significance tests, there is no place to mention type-I/II errors, only type-S and type-M errors, that are more relevant. Neyman/Pearson tests (AB-tests) can be adressed in specialized courses and for problems where one can define a sensible loss-function (I don’t see it in research, however). Students would come back to the relevant questions and don’t take a test for a “proof of an effect” (or, even worst, for a proof of the absence of an effect).

Please let me know where I am wrong.

2 Likes

Your answers are always valuables and provide an opportunity to improve our knowledge in statistics and probability. Thanks. I see we agree with both comments. You’ve described the machinery that underlies the testing procedures (get data and test whether the data is compatible with the H0), and also I definitely agree that the hybrid approach requires that we navigate around a pool of quicksand all the time trying to adjust both Fisher and Neyman approach. It seems this merged way was first developed by Everett Lindquist, but I’m not sure of that.

My point was not that, however, I imagine I was not clear. From a historical perspective, despite Fisher did not develop the type I / II errors, when he had to deal with these concepts it seems he was more concerned with type I than type II.

I was reading this recently here (Serlin, Ronald C. (2010) “Fisher Was Right,” Journal of Modern Applied Statistical Methods: Vol. 9 : Iss. 1 , Article 2.
DOI: 10.22237/jmasm/1272686460), which motivated me to add this contribution.

1 Like

Thank you for the reference. I will read it, so it will take a while until I will come back.

You’re welcome. Please let me know any reference you think I also should access to strengthen this knowledge about these historical aspects of statistics. I find absolutely amazing the evolution of stats and sometimes I feel lost when trying to understand its evolution.
Have a nice day!