What is type I error?

Setting the important but nettlesome issues in the mathematical logic aside, the combination of terms “type” and “error” created an essential misdirection, if not just a fundamental mistake. The persistent use of the term betrays the general human impulse to coerce a spectrum of uncertainty into a categorical framework. Language matters, and the terms “type" & "error” imposed a completely different (categorical) construct that stuck. Why this is so is explained to my satisfaction by Behavioral Economics.

Human do this sort of thing routinely. When confronted with a perplexing problem, we relieve our aversion to ambiguity and enjoy cognitive ease by unwittingly answering a substitute, simpler or more tractable question. Instead of estimating the probability of a certain complex outcome we subconsciously substitute an estimate of another, less complex or more accessible outcome. We avoid and never get around to answering the harder question.

In designing and conducting experiments we are updating our uncertainty with data and information. But as humans, we are uncomfortable with, and inept at, innate probability calculus; and with our bias toward ambiguity relief, we were seduced too easily into Fisher’s finesse and Neyman’s beguiling categorical elaboration of that.

I am not sure that type I error can be rehabilitated with a reconceptualization or a refined definition. Perhaps the greater good is served by explicating for those unconvinced the merits of the Bayesian perspective of rigorous quantitative reallocation of uncertainty with new information.

I hope this perspective is helpful.

6 Likes

I fear that some of this may have been prompted in part by a recent Twitter discussion on type I error control with co-primary endpoints. While I understand the distinction between forward and backward probabilities here, somewhat akin to p-values, could you expand on how this affects type I error control strategies with multiple endpoints?

I’ve written about the multiple endpoint problem here. You may want to expand on your question Venk, but my feeling is that seeking to control type I error is doing exactly what Drew Levy described so eloquently—substituting an easy to compute quantity for the really needed quantity to avoid having to work too hard. For example, frequentist statistics may try to compute the probability of getting data more extreme on any of 3 endpoints given no treatment effect on any endpoint whereas a Bayesian might go right for the artery: What is the probability that the treatment benefited any 2 of the 3 endpoints? But if the type I error problem is significant in the single endpoint case, we don’t need to worry too much about how much worse the problem gets in the multiple outcome case.

But there is one issue related to your question that always needs discussion. “Errors” (better labeled assertions) add up when you have more opportunities of making the errors and the opportunities are somewhat independent of each other. For example in computing the errors in computing a product of several terms, the errors in how each individual term was computed “add up” (more likely they multiply). When doing a frequentist clinical trial, if you look at the data every 20 patients, you will see more data extremes. By giving the data more opportunities to be extreme, type I “error” increases. [Note that even in this simple example, other questions arise, e.g., why is it logical to adjust early looks for the fact that you plan to take a later look?]

Contrast the increase in type I "error’ as you take multiple looks at the data, giving more opportunities for extremes, to the issue of asking more questions by attempting to assess evidence about each of several endpoint variables. In the frequentist world, type I error is at issue because looking at ore than one outcome gives data more chances to be extreme. Now consider what happens when you transition from computing probabilities about data to probabilities about parameters. The probability that treatment B positively affects endpoint number 2 is nowhere affected by the probability that it affects endpoint number 1. Probability statements about \Delta_2 don’t need to refer to probability statements about \Delta_1. Questions I ask about parameters are fair game. Questions I ask about data need to consider the data generating/sampling process.

Returning to the sequential analysis of a single endpoint, the simplest way to see that multiplicity is irrelevant once you are on the right scale, a posterior probability of efficacy made two years into the study supercedes the posterior probability of efficacy that was made one year into the study, making it obsolete. The new 2-year calculation is self-contained and doesn’t need to know anything about the 1-year calculation. Bayesians condition on all available information, and we just have more information available at 2y.

3 Likes

Thank you for the detailed answer. I’ll have to read it a few more times to completely get it I suspect.

I do feel like I am beginning to grasp the issues with type I error rates in this frame work, but I am still not sure concretely what this means if one is evaluating a study analyzed in a frequentist manner with multiple (even hundreds) of endpoints? It seemed to me that Bonferroni and other approaches appeared mostly sensible. Is that not true?

Pre-study power? I thought post-study power in a frequentist framework (usually called post hoc power, right?) is a meaningless concept?

Yes I think post hoc power is meaningless. Just said “pre” to make it clear that’s the only one I’m tempted to use.

Think of it this way. If you had 1000 outcomes would you want the evidence for one endpoint to influence the evidence for another one?

That part of the FDR methods always didn’t sit well with me. On the other hand, testing many different hypotheses in the frequentist framework comes with a risk of a lot of problems. How would one do frequentist -omics without some sort of type I error control? For a moment, let’s say we have to stay with the frequentist paradigm.

1 Like

Omics approaches are exploratory. Tests (significance or hypothesis tests) rarely make any sense here. P-values can be used just as log fold-changes simply to rank the list of genes, getting the “more interesting” candidates more to the top of the list. I find it awfully nonsensical to write that one has found “327 differentially expressed genes”, based on some FDR. What does this tell us? On what basis was the FDR chosen? Were undetected genes (or other molecule types) or measurements below some threhold or with insufficient quality considered (as they could have been detected and then would impact the multiplicity correction)? Sometimes other criteria are used in combination (often logFC). To my knowledge it is unclear how to interpret P-values at all in such situations.

1 Like

I share your dislike of FDR in this setting. Multiplicity problems largely vanish if a non-zero null hypothesis is used, and my colleagues Jeffrey Blume and Hakmook Kang have shown that simple likelihood ratio tests, i.e., ratio of likelihoods at an impressive odds ratio level to the likelihood at an odds ratio of say 1.1, reduces both type I and type II error.

Dear Frank,

How can one say that type I- and II-errors are decreased (or changed)?
You test different hypotheses, you even conduct a different kind of test (from a significance test to a hypothesis test). In the significance test you can not make a type-I error (= falsely accept B when A is true, as there is no B), you can only falsely reject H0 under the null model. This is often misnomed as “type-I error” but is is logically something different. There is no loss associated with this kind of error (in contrast to a type-I error). There is also no type-II error (=falsely accepting B when A is true, as there is no B, yet again). So even if you fix an alternative to test against, there is no control of the type-I and type-II errors, because the sample size was not adjusted to control these errors. I would be grateful for an explanation.

The likelihood school of inference, unlike frequentism, does not fix \alpha as n \rightarrow \infty. \alpha can decrease with n. But best to go to expert Jeffrey Blume’s work, starting here. Magic happens when you get away from “zero” null hypotheses in the high-dimensional setting.

1 Like

I think this is a really excellent, well said, and much needed perspective. When I think of alpha, I think of scientific integrity, not chance. Certainly, random error factors into these problems. But in an era of mega trials, precision medicine, and multi-million dollar budgets, we can’t ethically pretend “chance” is as much of an influence on findings as it’s believed to be.

A “false positive” doesn’t begin or end when a trial is declared “statistically significant”. When data are falsified to achieve a significant result, a false positive error has been committed. If the study design or study conduct exhibits a systematic departure from it’s intended purpose (bias), and this compromises the results, you have (in effect) declared something to be true which you have not directly assessed: in other words, any exaggerating bias increases type 1 error. Even when the null hypothesis is in fact false in the target population, when the magnitude and interval of effect is inconsistent with that found in the study, the study has incorrectly conveyed the promise of a new therapy or treatment to save lives. I’d be very angry if I elected for an aggressive chemotherapy that promised a 50% chance of remission in 3 months when in fact the chance of remission was only 10%.

I think this generous definition of a “false positive” reflects the present overreaching impact of research on policy, guidelines, information and dissemination. It is nearly impossible to undo tempting and incorrect conclusions from scientific research; indeed, if one looks at Retraction Watch, they’d believe we’re awash in incorrect science. Americans and Europeans are dying at staggering rates of preventable diseases because they’ve foregone vaccinations.

1 Like

Would you please write out the probability statement that corresponds to your notion of the probability of a type I error? And by “When data are falsified to achieve a significant result” you are implying that a second study was negative. How do you know the second study is reliable if its sample size is not infinite? How do you know that the treatment in reality does not work?

One of the challenges of the “Quality Control” interpretation of the \alpha-level you give here is that it strongly relies on the actual independence and identical distribution of units comprising the sample (and population). This is, of course, nothing like people or real life. That’s okay for the most part, but we have to realize the generalizability of the trial results are merely to “other trials like it”. It’s a considerably more sophisticated task to predict individual outcomes… but maybe we should be doing that anyway. Perhaps that even underscores the whole issue with modern statistics, in a nutshell.

3 Likes

I really like Pavlos example.

For non-statistician clinicians (like myself), Steven Goodman article “the p value fallacy” also offers a great lay explanation on the topic with clinical examples.

https://www.ncbi.nlm.nih.gov/m/pubmed/10383371/

Here is an excerpt:

“Most researchers and readers think that a P value of 0.05 means that the null hypothesis has a probability of only 5%. In my experience teaching many academic physicians, when physicians are presented with a single-sentence summary of a study that produced a surprising result with P=0.05, the overwhelming majority will confidently state that there is a 95% or greater chance that the null hypothesis is incorrect. This is an understandable but categorically wrong interpretation because the P value is calculated on the assumption that the null hypothesis is true. It cannot, therefore, be a direct measure of the probability that the null hypothesis is false. This logical error reinforces the mistaken notion that the data alone can tell us the probability that a hypothesis is true.”

3 Likes

Dear Frank, you really leave me confused here…
To my knowledge (that is likey quite incomplete!) α is the size of the hypothesis test, and (1-α) is the confidence in accepting B. This is specified independent of the data and also independent of n. And it is particularily defined over B. Without the specified B, what is the meaning of α?
If in a significance test a p-value is calculated, you may consider your data sufficiently significant to reject H0, but I really don’t see what this has to do with α. Maybe that promicuous use of terms and symbols outsider their specific definition is part of the reason for the whole mess with testing we observe today?

Thank you for the link Jeffrey Blume’s work. I am working through it.

Hope I’m answering your question. If you do a standard (frequentist) statistical test, you will get a p-value. This gives an indication how extreme your data are. A p-value of 0.04 for instance means: “If we assume that the null hypothesis of no effect is true, then your data are quite extreme. If we would repeat your sampling many times, only 4% of your data would be more extreme.”

Extreme data like this, may be the result of wrong measurements with faulty instruments. They may also be caused by something else. But if we exclude those, we are left with two explanations:

  • either this is an coindicidence. Extreme events do happen. People throw the dice and roll a series of ten 6’s. It happens.
  • or your null hypothesis of no effect is not true.

Usually we have some threshold, such as 0.05 (or 5%). We say: “Hey, these data are so extreme that I don’t believe this is coincidence, I reject the null hypothesis of no effect. I know that I may be wrong, that the null hypothesis is true and that I reject it wrongly, but I am willing to accept that Type I error”.

We can rephrase the two explanations as follows:

  • I have extreme data and the null hypothesis is still true, or:
  • I have extreme data and the null hypothesis is not true

Or even shorter: “given these results, the null hypothesis may be true or not”. That’s all that frequentist statistics learns us. The reason is that frequentist statistics gives statements about the data (“these data are extreme, assuming H0 is true”) and not about the hypotheses. If you want to know: “given these data, how likely is it that H0 is true?” then you need bayesian statistics. Frequentist statistics can’t answer that one.

Example: a group of patients responds well to a certain medicine. Does this mean this medicine will be effective in the population? Statistical analysis shows that IF the medicine would be ineffective, these results would be in the extreme 2%. Now your reasoning is:

  • either this is coindicidence, my patients improved spontaneaously. Things like that happen. It is extreme, but it does happen
  • or: my null hypothesis of no effect is not true, and therefore I must conclude that this medicine IS effective

If your colleague asks: so, does the medicine work? You can’t asnwer that. All you can say is: “given the results, the medicine may be work or not”.

If she asks: “How sure are you? What is your false positive rate?” then you can’t answer that one. “Either it works, or it doesn’t”.

A false positive rate (as used in daily life) would be: “of all 100 discoveries of medicines claimed to be effective, only 75 really work. We have 25 cases of false positives.”

This is different from Type I error, which says: “Assuming the medicine does not work, for every 100 times we say that the medicine does not work, we are mistaken 5 times in which it DOES work”

Hope this helps…if I’m stating the obvious, please say so and I will gladly remove this post. :grinning:

2 Likes

This doesn’t help very much but I just wanted to note the broader context: pure likelihood inference doesn’t care about type I and II errors but you can study frequentist operating characteristics even for non-frequentist procedures. In some cases likelihood beats frequentist at its own game.

Frank, In response to Jochen you stated, “I share your dislike of FDR in this setting.” Apropos of type-1 error, Is there a setting in which you actually approve of the FDR? Elsewhere, you have used the apt term ‘regulators regret’ as an epithet for the FDR. Is the FDR a legitimate concern in any situation (other than a simulation) in which we are making inferences from data?