Assessment of @PeterM and @Sander:
While I agree with f2harrell and we should criticize ideas not people, I would like to provide my assessment of both parties as an outside observer and discuss points of claimed contention by both parties. I will proceed by summarizing posts and comments by each party then providing my own thoughts.
@PeterM. He started this thread with three “controversial” theses to get fruitful discussion going. He also emphasized that it is an opportunity to learn. Finally, he talks about how he will expand in further comments and “answers to questions”. This final point will come in later.
@Sander. Proceeds to thoroughly respond to each thesis and sub-components of said theses. He says things such as “it is unclear as to what is meant by ‘significance test’” which implies a question and invites @PeterM to clarify. @Sander also takes the time to mention his definitions/stances, writings, the diversity of perspectives, and generally establishing nuance for us to explore.
Given that the theses reference statistical/mathematical entities, Sander rightfully dives into more “technical details” which assuming other individuals understand, provides a necessary basis/context for the discourse to advance. I couldn’t imagine desiring a thoughtful discussion regarding p-values and statistical tests without comprehension and discussion of the mathematical machinery among other things (e.g. sociology of the science, science in practice).
@PeterM. “I’d like to reiterate that I’m interested in a discussion that is about possibly learning something (hence the explicit invitation to ask questions)—and that can’t really be achieved by a) simply finding one interpretation of what somebody else said that makes no sense and proceed to lambaste it and b) lengthily show off how clever we are without ever trying to find out what the other person actually means. Because then one ends up with heaps of text that have nothing to do with what that person is saying—an unfortunate state of affairs.”
We have a variety of statements from @PeterM which can be summarized as follows:
- An expectation that the initial discussion would be people asking questions and trying to figure out what @PeterM means by his theses.
Personally, I don’t know why @PeterM didn’t initially put some if not all of this information in his initial post or if there are space requirements, subsequent comments. His initial post says he would in 1) separate comments and 2) answers to questions.
Furthermore, when choosing controversial topics like p-values and significance tests, it shouldn’t be surprising that people will include thoughts of their own. Finally, @Sander highlights the confusion regarding properly defining and interpreting p-values, and the lack of clarity regarding what is meant by signifiance tests.
- Belief that @Sander’s response incorrectly interprets original theses
As Sander correctly states, vagueness exists within the original “theses” (e.g. significance tests) which gives room for people to provide their thoughts instead of sticking to @PeterM’s latent defintional framework. Regardless, individuals, especially working statisiticians who have written on said topics, should be able to share their own frameworks as well as what else exists in the large literature, historical and more present.
Nevertheless, I can see why @PeterM feels misinterpreted. @PeterM draws from statements in Fisher’s “Design of Experiments” which he posts on his blog more fully. @PeterM focuses on some general aspects:
-
Repetition or extended record of data
-
Data which comes from a setting which rarely fails to
provide “statistically significant results”. Obviously much has been said in regards to over -emphasis on chasing statistical significance. Maybe we can more charitably interpret this as designing a study which has the capacity to consistently tell us something about our phenomenon of interest. However, I think this is a vague and unhelpful notion since in practice things are challenging and complex, and we typically have less control than we would like. Also, we could simply crank up the sample size and consistently get statistically significant results, but I don’t want to be too unfair to Fisher and by proxy @PeterM, hence my more charitable idea.
-
We can make logically valid choices (e.g. “which theory of two (or more) is the better/best one”). Per the linked Notturno quote, a modus tollens process (classified as deductive in logic) where we identify an inconsistency between the falsity of one statement and the truth of other statements. To quote from the link:
The best that a logical argument can do is test the truth of a statement. But it can do this only by showing that its falsity is inconsistent with the truth of other statements that can only be tested and never proved. Our so-called ‘proof’ methods are really techniques for testing consistency. And the demonstration that a ‘conclusion’ follows from a ‘premise’ shows only that the falsity of the ‘conclusion’ is inconsistent with the truth of the ‘premise.’
The inconsistency that marks a valid deductive argument — the inconsistency, that is, between the truth of is premises and the falsity of its conclusion — cannot force us to accept the truth of any belief. But it can force us, if we want to avoid contradicting ourselves, to reexamine our beliefs, and to choose between the truth of some beliefs and the falsity of others — because the falsity of the conclusion of a valid argument is inconsistent with the truth of its premises. …
And this is just another way of saying that what we call a proof actually presents us with the choice between accepting its conclusion and rejecting its premises
Another Fisher quote from the Design of Experiments within a Fisher quote set linked to (“repeated reminders”):
In relation to any experiment we may speak of this hypothesis as the “null hypothesis,” and it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.
It might be argued that if an experiment can disprove the hypothesis that the subject possesses no sensory discrimination between two different sorts of object, it must therefore be able to prove the opposite hypothesis, that she can make some such discrimination. But this last hypothesis, however reasonable or true it may be, is ineligible as a null hypothesis to be tested by experiment, because it is inexact. If it were asserted that the subject would never be wrong in her judgements we should again have an exact hypothesis, and it is easy to see that this hypothesis could be disproved by a single failure, but could never be proved by any finite amount of experimentation. [16]
Combining these two quotes, we can say the competing “theories”, which I slightly rephrase, are:
-
Subject posssesses no sensory discrimination ability between two different sorts of object
-
Subject posssesses sensory discrimination ability between two different sorts of object
Some of my own confusion comes from Notturno’s language where they mention choosing between “beliefs” in a vague sense but then (implicitly?) specify the two beliefs as those contained within a single argument rather than competing arguments or statements. Given @PeterM’s claim that this methodology can inform which of two theories is better, I will treat it as a two step process. First we establish an inconsistency within a single “valid deductive argument” then show how that choice sets up another choice.
Let’s also say that the two "theories"are mututally exclusive and exhaustive in “theory space”, thus we can derive statements such as:
-
If theory #2 (discrimination) is false, then theory #1 (no discrimination) is true
-
If theory #1 (no discrimination) is false, then theory #2 (discrimination) is true
We can think of this as a light switch where if the switch is not in the off position, it must be in the on position and vice versa. I will also commit the sin of equating the theoretical statements with the null hypothesis, but as many know, we should be clear about delineating tests of statistical hypotheses and tests of theories or theoretical statements. In this toy problem, we are assuming the statistical test is close enough to a test of the “theories”.
First step. #1 is used to formulate a null hypothesis. Without getting into the nuanced weeds, the generic verbal “conclusion” is that the subject cannot accurately discriminate between two different objects. We can then create a blueprint for the design and analysis of a study, or in this case, an experiment.
To rely on @PeterM’s framework, if we repeatedly find the subject guessing correctly with high accuracy (i.e. "statistically significant) in independent experiments, this supports falsity of the conclusion and consequently creates an inconsistency with the premise of theory #1.
This forces us to choose between retaining the conclusion of theory #1 or rejecting its premise and consequently theory #1 in its entirety. So lets say we reject the premise of theory #1 which translates to rejecting theory #1 itself.
Second step. To reconnect the initial tweet which started this thread, “Significance testing is all about how the world isn’t and says nothing about how the world is”. I will consider both scenarios where the experimenter believes in theory #1 or theory #2.
In the scenario where the experimenter believes in theory #1, the inconsistency created in the first step and rejection of the premise of theory #1 allows the experimenter to select a “better” theory, namely theory #2. However, this is a consequence of the theory space which rarely if ever exists in practice. More specifically, we have a low dimensional theory space (d = 2) where theories are mutually exclusive. We can also add that our ability to investigate things is clear, which is the exception rather than the rule in much of research.
In the scenario where the experimenter believes in theory #2, one could say that the results re-affirm the experimenter’s belief or maybe the experimenter even claims its support. One could say this allows the experimenter to retain what is the “better” theory. This scenario is overly simplistic and arguably the more controversial since it seems to have an air of “proving the opposite hypothesis”. I treat this scenario as an acceptable possibility because of the properties of the theory space, which are the next focus.
Properties of the Theory Space. Given the set-up above, it seems like the extended record has the ability to tell us something about the world. I use Fisher’s toy problem because it’s the backdrop of such statements. However, as brought up by @Sander and others, we hardly live in a world like this. Our theory space is of unknown dimensions, our theories are multidimensional, and informative “tests” can be challenging and sometimes impossible to construct.
These practical and frequently pressing issues give credence to and help explain sentiments such as, “Significance testing is all about how the world isn’t and says nothing about how the world is”. Or instead of significance testing, a falsification framework for that matter. For the next problem, I will frame it as a frequentist and NHST procedure.
Using the restaurant example from page 545 of @Sander’s “Induction versus Popper: substance versus semantics”, let’s imagine one is eating at a food truck which serves only one dish and regardless of customer feedback, the owner doesn’t change how the dish is made or what the dish is.
Let’s claim the following:
-
The evaluation of the food truck, is a mutually exclusive and exhaustive evaluation (good xor bad).
-
Consequently, we will treat the problem as binomial distributed and assign corresponding probabilities to good and bad like one would with heads and tails.
-
Many people have told you that the food is amazing or “near perfect”
-
Good = G, Bad = B
Verbal and Numerical Hypotheses. Some plausible probabilities (P(G), P(B)) for food considered near perfect could be (0.99, 0.01), (0.98, 0.02), (0.95, 0.05), etc. That shows that the numerical null hypotheses or initial beliefs have a higher resolution than the verbal ones. Therefore, we have the problem of many numerical hypotheses mapping on to one verbal one. A starting spectrum for verbal hypotheses worded as frequency of occurrence could be:
-
The food is nearly always good (0.99, 0.01)
-
The food is mostly good
-
The food is equally good and equally bad. (0.5,0.5)
-
The food is mostly bad
-
The food is nearly always bad (0.01, 0.99)
The first issue is coming up with a way to map numerical hypotheses to a smaller spectrum of verbal hypotheses. People will have different demarcations. For example, when does “nearly always” become “mostly”. Or from a linguistic and psychological perspective, can we consider “nearly always” as a subset of mostly, thus it may not even arise as a verbal hypothesis and we can subsume it under the “mostly” claim:
- The food is mostly good
- The food is equally good and equally bad. (0.5,0.5)
- The food is mostly bad
Obviously, it depends on what resolution matters. Nevertheless, let’s continue with the numerical null hypothesis (0.99, 0.01) because you’ve heard nothing but amazing things about the food truck. Let’s say you eat at the food truck twice and one of the times is considered bad. Under the null numerical hypothesis, this equates to a probability of 0.0198. Under this condition, we may reject the null, but what are we left with?
Compatibility. At this point, it’s important to briefly mention an idea @Sander has pushed for and one which I like. Re-framing confidence intervals as compatibility intervals to aid interpretation and fight off dichotomania. For now, we can analogously view the rejection based on a small p-value as a declaration of incompatibility between the data and the null numerical hypothesis.
With this idea in mind, we can ask what numerical hypotheses the data is more compatible with in comparison to our current one (0.99, 0.01). This perspective highlights the problem of working in a theory space which isn’t overly simplistic (d = 2, mutually exclusive theories). There are a range of alternative numerical hypotheses with varying compatibility with the data we obtained. Furthermore, the range of numerical hypotheses messily map onto verbal hypotheses we originally formulated.
Consequently, even if we could somewhat find our way out of the statistical testing mess, it’s ambiguous how that should inform our decisions or beliefs in regards to competing “theories” (i.e. which is the better one) which for now correspond to the verbal hypotheses.
This problem doesn’t go away the more you repeat the experiment (i.e. collect more p-values), statistical significance or lack thereof, or where you sit in regards to the semantics of the philosophy of science. It simply comes with the territory of the theory space and conducting research. Notice that all I changed was the dimensionality of the theory space. As many here are aware, there are many other things we can discuss that create further difficulties
Finally, it highlights a routine problem for researchers, which is reasoning about statistical procedures (which NHST is one of many) and their relation to theoretical aspects of one’s substantive inquiry.
Concluding Remarks. Focusing on a narrow conception of scientific practice, knowledge generation, and definitions, will inevitably stonewall this discussion. While we have different knowledge backgrounds, ultimately discussion should be about expanding things. This is especially true when bold claims or as @PeterM describes them “controversial theses” regarding p-values and significance tests are made. These are entities which not only have vast literatures behind them, but also require discussion of mathematical/statistical components, philosophical components, and substantive components. A full dismissal of any of these components will prevent us from having a truly fruitful and nuanced discussion.
Brief Addendum. We can also find examples like Avogadro’s number and imagine mindless use of significance tests. For the original exposition of this idea see Meehl (1990) “Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It” pages 117-118.
3. Belief that @Sander is showing off, not trying to find out what @PeterM means, and provided heaps of texts that have nothing to do with the original post.
Showing off:
This forum is about sharing knowledge and hopefully finding more nuance than we could find elsewhere. Hence, threads like the paper collection on dispelling statistical myths.
@Sander has provided papers which articulate his own thoughts but also other people’s papers/books which he deemed relevant. We are obviously not the only individuals to have ever discussed these topics. Furthermore, as I stated previously, p-values and significance tests rely on mathematical/statistical components, philosophical components, and substantive components. We cannot ignore these components, and bringing them up in a thorough fashion is not only encouraged, but I’m thankful for it.
It is not showing off, it is simply sharing and thoughtful, especially when @Sander’s has an extensive background with the subjects at hand.
Not trying to find out what @PeterM means, nonsensical interpretation, and heaps of irrelevant text, not engaged in a single thing talked about:
Firstly, if one is to claim that the interpretations are nonsensical and the texts are irrelevant, it would be valuable to know why @PeterM thinks this. I’m not sure about anyone else, but I didn’t think what @Sander sent was irrelevant.
As for divining meaning and nonsensical interpretations, part of the journey through complicated topics is coming to mutual understanding regarding the elements of the discussion. However, just because one person may not understand something, doesn’t mean that the thing is nonsensical. What does @PeterM find nonsensical about Sander’s comments?
I’m asking these questions with the aim of learning something.
Furthermore, given that @PeterM did not explicitly specify how they view p-values and significance tests, I don’t think it’s unreasonable for @Sander to provide that context for the group to utilize. @PeterM likely considered this to be rude and may have wanted @Sander to first ask, “Well what do you mean by…” So let me be the one to ask @PeterM:
-
How do you define a p-value?
-
How do you define significance tests?
I think dismissing Sander’s contribution is rude and may mean @PeterM might only care about the “philosophicals”, specific quotes from Popper, Fisher, and others, and a discussion that only fits his idea of how it should go and what content it should contain. Additionally, Sander has most definitely attempted to address things @PeterM is talking about.
To provide an example based on a long term project I’ve been working regarding “statistical control” by regression, imagine if I put forth the “controversial thesis”:
Statistical control by regression doesn’t mean anything!
Then @Sander proceeds to say the statement is false if taken literally because if you look at the linear algebra and work out the algebraic derivations of the beta coefficients, you can see that some sort of adjustment is made in the form of residualization/orthogonalization. Sanders could then bring up past literature which discusses the rich history of regression, partial correlation, and even semi-partial correlation.
Now let’s say my working definition of statistical control by regression comes from specific resources or maybe textbooks where it is dubbed “holding all other variables constant”. I may be tempted to view his statements as irrelevant, when in reality they help bring us back to the purest form of the statistical machinery without the human interpretation projected onto it, for better or worse. It would be absurd to say such information is nonsensical, irrelevant, and didn’t engage with my thesis.
Furthermore, the mathematical treatment would then allow us to discuss whether the provided interpretation is misleading. More clearly, when I see statements like “all other variables held constant”, I think of control by stratification. Then the question becomes how well residualization aligns with selecting some strata, putting aside the practical difficulties of obtaining such a sample.
Obviously such a conversation might seem pedantic to some, but I think it comes from a place of questioning methodology and the claims we make regarding what they supposedly do.
4. Supposed insults by @Sander in regards to questioning qualifications and Pope Popper
Qualifications. In my eyes, calling someone a show-off and saying their contribution is irrelevant without engaging with any of it suggests a) lack of background or b) one has the background but deems the content irrelevant.
To better understand where @PeterM is coming from, his blog states he studied Bioinformatics at the FU Berlin and earned a BSc. I looked at the program’s “study regulations” and after imperfectly translating them using google translate, they state things such as:
-
The core subject bioinformatics is divided into. Study areas of computer science, mathematics and statistics as well as biology / chemistry / biochemistry
-
The Mathematics and Statistics study area provides basic knowledge and skills in the field of Analysis (differentiation, integration, ordinary differences equations), linear algebra (matrix calculation, eigenvalues, principal axis transformation), in which Statistics (elementary probability theory, statistical basic terms, decision-making, testing and estimating theories, linear statistical methods) as well as in the computer-oriented mathematics (representation of numbers,Stability and fitness, efficiency and complexity grips, numeric linear algebra, numeric quadraticture and integration)
Based on this information, @PeterM presumably has the proper background/capabilities in regards to Sander’s technical treatments, although @Sander did not know this at the time and was making an inference based on the interaction. This presumably eliminates option a) but then if @PeterM has the proper background or capabilities, why did he deem the content irrelevant and refuse to engage with it at all? I’m hoping he is willing to clarify this especially if there are to be continued interactions.
Pope Popper. @Sander criticizes claims within @PeterM’s third thesis and links a paper which directly discusses said claims (e.g. “induction doesn’t work”). Specifically he employs a latin phrase “ex cathedra” which based on Merriam Webster is defined as:
Ex cathedra is a Latin phrase, meaning not “from the cathedral,” but “from the chair.” The phrase does have religious origins though: it was originally applied to decisions made by Popes from their thrones. According to Roman Catholic doctrine, a Pope speaking ex cathedra on issues of faith or morals is infallible. In general use, the phrase has come to be used with regard to statements made by people in positions of authority, and it is often used ironically to describe someone speaking with overbearing or unwarranted self-certainty.
From this, we can see a thematic connection to the Pope Popper statement. Statements such as naive Popperians and Pope Popper understandably come off as antagonistic, insulting, and/or a jab. So a “softer touch” may have been better. But I don’t know where people stand on such comments. While rough around the edges, there is a point to be made about zealousness which can impede discourse.
I can see how copy-paste statements with “unwarranted self-certainty” in regards to induction, evoked such a reaction and play on words to communicate and further Sander’s criticism of Peter’s claims. Furthermore, @Sander stated he has engaged in such conversations for decades and rigid interactions can become tiresome. Finally, I don’t think @Sander had ill intent.
To provide a personal example, when TAing for a undergraduate psychology methods course, we used a textbook which generally defined a small p-value as “unlikely results were due to chance”. I talked with some post docs about my issues with the statement, but the conversation halted at me needing to accept said definition of the p-value because “that’s what the p-value is”. I became increasingly frustrated because the interaction felt “dogmatic” and “zealous” although it was unclear who the Pope was at that time.
Since then, I think it comes from Fisher’s famous and influential lady tasting tea example. In such a context, the null model poses a data generating process driven by chance. To show this, I’m going to employ some terminology from thermodynamics, namely microstates and macrostates (Statistical Interpretation of Entropy and the Second Law of Thermodynamics: The Underlying Explanation | Physics).
The link provides the set up using coin tosses:
-
For a 2 coin toss scenario, macrostates includes descriptions such as “2 heads” and “1 head, 1 tails”. Microstates include HH, HT, TH.
-
The insight is that permutations pertain to microstates and combinations pertain to macrostates since “1 head, 1 tails” doesn’t care about the order of said elements
When asking, what is the probability of “two heads”, the answer is:
- the number of times this macrostate happens (i.e. number of applicable microstates) / number of all possible microstates
Clearly we don’t care about individual microstates, but meaningful collections of them (i.e. macrostates). Our investigation doesn’t pertain to one particular microstate, especially when each one is equiprobable and forms a uniform distribution.
While words like chance can be vague, the macrostates we identify are constructed from equiprobable microstates thus a distribution representing the macrostates assumes a “chance” model or underlying data generating process. Therefore, it makes sense to word the rejection of the null hypothesis as “due to chance”. More thoroughly and using the common language, we can say “the low p-value suggests it’s unlikely that the subject guessing is due to chance or that the subject is ‘chance guessing’”.
So now we have a plausible explanation for why this “due to chance” language has continued to persist in courses, textbooks, and scientist’s lexicon. Why is this problematic?
Let’s return to coin tossing and test against a binomial distribution where p(H) = 0.9 and p(T) = 0.2. Imagine we toss a “fair coin” and get results which lead us to reject the null hypothesis of an unfair coin. Obviously in this context it doesn’t make sense to say something like “unlikely due to chance”. Therefore, why would one include such a phrase in a proposed general definition of p-values? We can also consider a test of a regression coefficient. Is there any reason to believe that “chance” favors a null hypothesis of 0 or that “chance” underlies the null data generating process?
Unfair coin or unfair toss. Let’s now say:
-
We have a binomial distribution where p(H) = 0.5 and p(T) = 0.5.
-
The coin is tossed by one of Dr. Persi Diaconis’ (https://statweb.stanford.edu/~susan/papers/headswithJ.pdf) coin-tossing machines, but we currently don’t know how the machine was adjusted or anything about the coin itself.
-
We get a large p-value and we retain the null hypothesis that the coin is fair.
Using the chance language definitional framework, does that mean an acceptable rewording is “the coin is ‘chancy’” or “the coin flips according to what we would expect by ‘pure chance’”?
After we reword it, let’s say Dr. Diaconis reveals that the coin is heavily weighted in favor of heads, but he made adjustments to the coin-tossing machine such that the coin toss would function like a “normal toss” of a “fair coin”. Can this data generating process be properly described as “chance”?
Furthermore, imagine we reject the null hypothesis. A typical alternative hypothesis is that the coin is unfair, but as shown in Diaconis et al. paper, the coin-tossing machine can be adjusted such that the coin always lands on the side which was originally facing up. Imagine such a scenario but with a “fair” or an “unfair” coin. Given our lack of knowledge regarding the data generating process, even with a rejection of a null hypothesis, we are unable to say whether the results are due to an unfair toss of a fair coin or a fair toss of an unfair coin. It doesn’t matter how many times we conduct such an “experiment”, how many statistically significant results we get, etc.
Once again, falsification of statistical hypotheses may have little bearing on our “theories”. Returning back to @PeterM’s remarks, it’s hard to see how in more realistic contexts an extended record of statistically significant results can allow us to make “logically valid choices”, let alone any choices.
Concluding and Summarizing Remarks:
Overall, I hope my attempt to summarize the creation and unfolding of the thread was satisfactory. I hope I succeeded in identifying the areas of conflict, confusion, and where individuals can provide clarification. Finally, I hope what I added in regards to the primary discussion proves to stimulate further conversation.
Ultimately we should strive for nuance and plurality rather than dismissal. If @PeterM feels what @Sander contributed is nonsensical interpretation, heaps of irrelevant text, etc., then he will need to support said claims and answer some of the questions I raised (e.g. how do you define p-values).
The show-off remark is obviously not inline with the goals of this forum or any forum for that matter. In terms of the “Pope Popper” remark, while one could see that as a targeted insult, and lighter wording could have been used, I do think @Sander’s word choice wasn’t malicious and was meant to criticize a kind of claim and argument form he has seen in the past by zealous, and what he considers naive, Popperians. In my eyes, it’s a statement of “This claim fits the modus operandi of said individuals” rather than an outright insult such as “show-off”. I think the statement is acceptable, especially when Sander directly follows it with why he thinks so and relevant sources. This shows a genuine desire to debate the claim and express his honest thoughts regarding said claim.