Significance tests, p-values, and falsificationism

PeterM · September 7, 2021, 3:20pm

A word on that. There certainly is a careless use of the word ‘induction’, except it’s not Popper’s but yours. You say it has “many uses and meanings” and imply that Popper doesn’t pay attention to this. Which is, of course, false. (Should I add here that “of course” doesn’t translate to “dogma” but means ‘something any critical reader of Popper should know—or ask an expert about’?) Popper was very careful to say what he meant when he used the word ‘induction’. In your “Induction versus Popper”, you quote bits from LoSD and from the “Replies to my Critics” (in the Schilpp volumes), but you leave out clarifications and specifications that render your criticisms moot. You quote from LoSD:

It is usual to call an inference ‘inductive’ if it passes from singular statements (sometimes also called ‘particular’ statements), such as accounts of the results of observations or experiments, to universal statements, such as hypotheses or theories.

But it doesn’t just pass somehow from one to the other, does it? It is only three paragraphs below that quote that Popper clarifies the important bit you leave out: that induction in his sense would entail “that the truth of [a] universal statement can somehow be reduced to the truth of singular ones”. And that is precisely the idea I explicitly used in the OP.

You go on to quote from the “Replies”: “I hold that neither animals nor men use any procedure like induction, or any argument based on repetition of instances. The belief that we use induction is simply a mistake.” And you go on to claim that Popper’s assertion there is “superficial and absurd”. Well, your interpretation is certainly absurd, but there is no indication that you even considered the possibility that the absurdity was introduced by the interpreter. It would be equally absurd to believe that that one quote must have been the best or even last word by Popper on the matter. Because this example, too, resolves into mootness once we complement it with the central locus for Popper’s view on induction, Objective Knowledge, where he defines the classical problem of induction, as clearly as anyone could wish:

Can the claim that an explanatory universal theory is true be justified by ‘empirical reasons’ ; that is, by assuming the truth of certain test statements or observation statements…?

Again, that’s precisely the idea I used in the OP. tl;dr: When Popper talks about induction, he talks about logic. Not about whether “you never went back to a particular restaurant because every time you went there you had a bad meal”. The fact that you can carelessly call anything ‘induction’ (as you do in your paper, claiming that Popper’s “process could just as well be called ‘induction’ ”) doesn’t mean anything at all for our interpretation of Popper. Because he was actually precise about this.

And of course this last thought should be familiar, as it appears not far below in the “Replies”, where Popper says, as if to specifically forestall criticism like yours:

To point (4)—induction is a myth—I wish to add only that nothing depends upon words. If anybody should write, as did Peirce, “The operation of testing a hypothesis by experiment … I call induction”, I should not object, as long as he is not misled by the word. But Peirce was misled, as were many others.

And so people continue to be misled about induction and into thinking that Popper was being absurd when he said we didn’t actually use it. He just wasn’t talking about the everyday heuristics of trips to the restaurant or any of our other prejudices and uncritical thoughts (which are, of course, real and plentiful). He was talking about knowledge and how we gain it.

Sander · September 7, 2021, 4:56pm

This is in reply to your first reply “It certainly does speak to that”: None of your reply is relevant to the point I was making, possibly because your reply is obsessed with how we treat a targeted hypothesis H. But any competent, experienced statistician knows there are many points where we put H aside and focus on issues with the set of auxiliary (background model) assumptions A and the observations (X,Y) of predictors X and outcomes Y, not the focal hypothesis H [for example when there is underdispersion in Y relative to E(Y|A,X) based on the standard X,Y measurement models in A]. Those anomalies can raise previously unappreciated issues with the entire analysis approach or research program up to that point, an issue that as I’m sure you know was a focus of Lakatos.

You have so far however quoted nothing in Popper that displayed such appreciation let alone explained how to address the problem. So please do so because real research in “soft” sciences is riddled with such anomalies, which are often obvious to trained eyes but ignored or unnoticed in research reports (whether to get the paper out or out of simple failure to look). As some like Feyerabend (whom you misread) noted, if one stopped to dissect them all in every study across a topic, we’d have nothing left to knowledge but certainty that we know nothing rather than just knowing we know nothing with certainty.

Sander · September 7, 2021, 11:06pm

Replying to “A word on that…”: Nowhere did you point to where I used “induction” carelessly - you just assert that. Instead I took direct quotes from Popper and criticized them. I did not say he never paid attention to the many meanings; and whether I implied that is in the eye of the beholder. Among the many things I did write that you ignored (and perhaps did not even read) was that
“Another way to interpret Popper’s claim so that it makes sense is as an answer to the following question: Is there any method of reasoning that will, with certainty, lead us to the correct explanation of our observations? Such a method would, of course, be invaluable if it existed, but Popper’s writings make clear that he regarded such a method as impossible and (hence) non-existent. As discussed earlier, there appears to be little or no controversy on this point. Because of the non-identifiability of theories, there indeed can be no such method.”

I will repeat that processes ordinary scientists call “induction” exist and are in constant use throughout science and engineering, just as we use them in restaurant choice. Much of statistics is derived from the straight rule of induction and its modifications (e.g., by Laplace, as in his rule of succession), as well as probability logic. Among reviews of these issues with specific reference to Popper are Good Explicativity, Corroboration, and the Relative Odds of Hypotheses on JSTOR, Eells eells_pm.pdf (fitelson.org), and Greenland Probability Logic and Probabilistic Induction on JSTOR

So the Popper quote remains superficial and absurd, at least out of context. This leaves as the real question: Are such quotes mere provocations for deeper discussion, much like de Finetti’s famous claim that “Probability Does Not Exist”? I’d say yes insofar as both treated induction and probabilistic reasoning as mental processes that did not carry a force of certainty or possess a correspondence in the real world that some assert classical deductive logic does. I think neither denied however that those mental processes have some value for some important goals, even if they were far apart from each other on which goals those are. What I would deny is that classical logic is any different, for reasons I mentioned in my last reply to Norris, above.

I’ll remind you again that I am all for having falsificationism (or whatever label you want for Popper’s central epistemology - please designate) as part of core required reading for methodology, as long as it is done critically (as with everything). But there are hundreds of other essentials too, all of which must be summarized and covered within reasonable time (a hard task as I can attest, having served some 3 decades as chair of the doctoral examination committee in epidemiology at UCLA and having coauthored a major textbook in epidemiologic methodology). It is thus up to Popper scholars with considerable training and experience in real research (like Maclure in my generation and topic) to distill essentials for their field.

Perhaps, if you ever gain real research experience in some area you can do the same for it; but I think you are off to a terrible start with no sign of humility (essential for learning) about all the technicalities and research realities you don’t yet comprehend (and probably never will if you don’t get over yourself as well as Popper). Meanwhile, you haven’t begun to address the manifold concerns I’ve raised; instead you indulge in cherry picking what to pounce on and cast into the flames. So I think I’ll tune out of your posts until you can reply more constructively, dispassionately, querulously, and pleasantly.

I’ll close by recounting how every philosophy of science professor I’ve met, read and corresponded with about Popper (more than a few in my half-century plus of academic experience) has remarked on the odd psychosocial phenomenon of how Popper seemed a magnet for those craving some kind of ultimate certainty in a philosophy, especially those seeking a prophet to which they can become fanatically devoted - longings usually met by religion. Some devotees go even further, indulging in delusions about Popper’s general ascendance, completeness and infallibility, as if speaking for all philosophy of science and delivering the good news to all scientists - a phenomenon worthy of study in psychopathology.

Fortunately there are plenty of books and articles that turn critical rationalism back on Popper in order to see where he fell short and move on from him while preserving the ample good. That kind of movement is essential to deal with critics of Popper from applied sciences, like those in the cites I gave above (there are plenty more around). Sometimes those critics are indeed dealing with caricatures of Popper (eg, much of my essay was directed at what I heard or read in presentations by contemporary Popperian colleagues, not from Popper). But in those cases the fault as much lies with those Popperians who uncritically promoted Popper in ways that lend themselves to caricature and dismissal as rabid or involuted evangelism.

Mark · September 8, 2021, 4:42am

Preliminary Addendum: As I’m finishing this on the night of 9/7, I just read the new exchange between @Sander and @PeterM on 9/7 and other interactions I missed prior to 9/6. For the sake of getting this out, assume that this was written and finalized as if I hadn’t read them or if I sent it before they occurred. Thank you. Also, as a new user I’m limited to 2 links and I will write a reply to this in order to fit (32000 character limit) my full write-up (approximately 40000 characters).

I would like to contribute to the discussion in the following ways:

f2harrell- Lastly I hope someone can summarize key points from all sides in simpler language that what I read above.
Given the tension between @Sander and @PeterM, I as an outside person, would like to provide an assessment of the interaction
Contribute to the content of main discussion

All of this started from the twitter thread interaction linked in the original post, so let’s start there and pull what I think are the most relevant sections:

Brian Z: “Significance testing is all about how the world isn’t and says nothing about how the world is.” I’ve never understood this criticism. NHST does take into account how the world is, and science is full of good reasoning about how the world isn’t.

Peter M Response: That criticism fundamentally misunderstands (or ignores) what significance tests in particular & falsificationist methodology in general are about. Neither makes sense if you pursue the “idol of certainty”, i.e. (in an absolute sense) “the true theory”. See: The recklessly critical quest for truth » The Open Society
…

Frank H Response: What is your most concise defense of falsificationism?

Peter M Response: That it’s the only way to learn from experience given that a) there’s no such thing as absolute (ie certain) truth and b) there’s no such thing as logically valid (or in any relevant sense useful) induction. A detailed reply to common misrepresentations: https://necpluribusimpar.net/why-falsificationism-is-false/#comment-3575
…

Peter M: “No such thing as valid induction: That’s the textbook consensus”

Sander G: ““No such thing” as valid induction: That’s the textbook consensus.” You must not have read much Bayesian literature. But even if it were consensus that wouldn’t make it correct. How about being open to evolution of thought via criticism, as Popper advocated in his best moments?

To summarize:

Both @PeterM and Brian are criticizing the claim made in the linked article. In particular, Peter claims a) the claim ignores what significance tests and falsificationist methodology are about, and b) an extra claim in regards to “idol of certainity”. Then provides the quote for b.
An interaction with f2harrell in regards to falsificationism which leads to @PeterM’s defense reiterating a) no such thing as certain truth and b) there’s no such thing as logically valid induction.
Interaction between @Sander and @PeterM where one pressing issue pertains to the validity of induction

My Assessment and Questions:

Certain truths. To first address the “certain truth statements”. As brought up by @Sander, we need to think about the “sociology of science”. I don’t think most scientists view their work as getting to a certain truth or “idol of certainty” so I think it’s an irrelevant point even if @PeterM later claims its “often hold only implicitly”.

Furthermore, I think repeated critcisms of significance tests/NHST don’t pertain to “certain truth”, but because of things such as mindless usage and misinterpretations. To return to the contended article quote:

Significance testing is all about how the world isn’t and says nothing about how the world is.

Firstly, this is a quote from Geoffrey Loftus. This is just one member of the scientific community, past and present, and as @Sander pointed out we should read deeply and widely. Before this quote, he also advises graphical methods of inference which can be criticized on their own merits especially when put in the hands of the average researcher. Neverthless, we can still ask ourselves, “Where does this criticism likely come from?”

As one of many who has gone through and questioned the NHST educational pipeline, my guess is that it comes from the obsession of comparing some test statistic against a “null” hypothesis which focuses on zero. Therefore, it becomes a mindless practice of:

Testing the statistical hypothesis of “is this statistically different from zero”
Finding that it is which translates to a statistical finding of “this isn’t zero”
Researcher takes the statistical finding and misinterprets or conflates it as support of a more general substantive hypothesis and/or “theory”

I think this evokes the criticism of “Hey this mindless practice is just telling you what things aren’t (i.e. not zero) rather than what things are (i.e. one’s substantive investigation)”. See how the scientific context does not require claims or desires of certain truth?

Falsificationism and Invalid Induction. @PeterM posted a comment he wrote about falsificationism and we will use parts of it:

The logical “asymmetry between verifiability and falsifiability” (LoSD, § 6) means that however many empirical statements one accepts as true, no generalisation from those statements to a universal statement is logically valid (that would be induction); but as soon as one accepts as true a singular statement that contradicts a universal statement, one has to accept the falsity of the universal statement. This, it is crucial to notice, does not mean that we thereby find something to be certainly true or false; logic can at best (and only if it is deductive) force us to make a choice (cf. Notturno). It can force us, if we accept its rules, to “choose between the truth of some beliefs and the falsity of others ”. And that is how it is possible to learn. (It is not, contra Philippe, how it is possible to “show that [a theory is] false”.)

And that is what Popper’s demarcation criterion of falsifiability actually means: “a theoretical system belongs to empirical science” (LoSD, § 19) if “the class of its potential falsifiers is not empty” (LoSD, § 21). It means that there have to be empirical statements whose falsity can be retransmitted to a theory for that theory to qualify as scientific. This in fact does, again contra Philippe, deliver “a purely logical criterion of demarcation”.

This refers back to the point we considered at the beginning of this comment: that it isn’t observations but our accepting a theory’s falsifier as true that actually falsifies a theory (because it is purely a matter of logic). “We do not take even our own observations quite seriously, or accept them as scientific observations, until we have repeated and tested them.” (LoSD, § 8) And even worse for the idea that anything about falsificationism can ever be “naive”, the falsifier has to be one whose truth would corroborate a rival theory: “If accepted basic statements contradict a theory, then we take them as providing sufficient grounds for its falsification only if they corroborate a falsifying hypothesis at the same time.” (LoSD, § 22)

This is actual falsificationism: taking the criterion of falsifiability (and its inherent repudiation of induction) seriously and additionally putting methodological rules (cf. LoSD, § 20) in place to prevent its circumvention. Whenever somebody—be it Lakatos, Feyerabend, or Philippe Lemoine—talks about “naive falsificationism”, you can be certain that what they are talking about is actually simply not falsificationism.

Using this information to determine (what I think are) key points regarding @PeterM’s perspective:

Generalizing from empirical statements taken as true is considered logically invalid and induction. This alludes to the logical fallacy of “affirming the consequent” which is if we have a statement P → Q, but conclude Q → P. This is where the “no such thing as invalid induction” tweets come in.
“If we accept a true singular statement which contradicts a universal statement, one has to accept the falsity of the universal statement.” This translates to Modus Tollens: If P, then Q. Not Q. Therefore, not P.

On Twitter and here, @Sander rightfully highlights that induction has many meanings, linked to his own thoughts, and warns of narrow definitions of induction which can lead to absurd statements or attacking strawmen.

I also agree that only using the propositional logic definition of induction (affirming the consequent) is narrow, leads to a straw man, and fails to take into account the diversity of definitions and the sociology of the science. Humorously, the devices and internet we use to interact have successful relied on “induction” in some capacity, thus I also find claims like “it doesn’t work” to be absurd and careless.

See my reply to myself for the rest of my write-up

Mark · September 8, 2021, 4:51am

Assessment of @PeterM and @Sander:

While I agree with f2harrell and we should criticize ideas not people, I would like to provide my assessment of both parties as an outside observer and discuss points of claimed contention by both parties. I will proceed by summarizing posts and comments by each party then providing my own thoughts.

@PeterM. He started this thread with three “controversial” theses to get fruitful discussion going. He also emphasized that it is an opportunity to learn. Finally, he talks about how he will expand in further comments and “answers to questions”. This final point will come in later.

@Sander. Proceeds to thoroughly respond to each thesis and sub-components of said theses. He says things such as “it is unclear as to what is meant by ‘significance test’” which implies a question and invites @PeterM to clarify. @Sander also takes the time to mention his definitions/stances, writings, the diversity of perspectives, and generally establishing nuance for us to explore.

Given that the theses reference statistical/mathematical entities, Sander rightfully dives into more “technical details” which assuming other individuals understand, provides a necessary basis/context for the discourse to advance. I couldn’t imagine desiring a thoughtful discussion regarding p-values and statistical tests without comprehension and discussion of the mathematical machinery among other things (e.g. sociology of the science, science in practice).

@PeterM. “I’d like to reiterate that I’m interested in a discussion that is about possibly learning something (hence the explicit invitation to ask questions)—and that can’t really be achieved by a) simply finding one interpretation of what somebody else said that makes no sense and proceed to lambaste it and b) lengthily show off how clever we are without ever trying to find out what the other person actually means. Because then one ends up with heaps of text that have nothing to do with what that person is saying—an unfortunate state of affairs.”

We have a variety of statements from @PeterM which can be summarized as follows:

An expectation that the initial discussion would be people asking questions and trying to figure out what @PeterM means by his theses.

Personally, I don’t know why @PeterM didn’t initially put some if not all of this information in his initial post or if there are space requirements, subsequent comments. His initial post says he would in 1) separate comments and 2) answers to questions.

Furthermore, when choosing controversial topics like p-values and significance tests, it shouldn’t be surprising that people will include thoughts of their own. Finally, @Sander highlights the confusion regarding properly defining and interpreting p-values, and the lack of clarity regarding what is meant by signifiance tests.

Belief that @Sander’s response incorrectly interprets original theses

As Sander correctly states, vagueness exists within the original “theses” (e.g. significance tests) which gives room for people to provide their thoughts instead of sticking to @PeterM’s latent defintional framework. Regardless, individuals, especially working statisiticians who have written on said topics, should be able to share their own frameworks as well as what else exists in the large literature, historical and more present.

Nevertheless, I can see why @PeterM feels misinterpreted. @PeterM draws from statements in Fisher’s “Design of Experiments” which he posts on his blog more fully. @PeterM focuses on some general aspects:

Repetition or extended record of data
Data which comes from a setting which rarely fails to
provide “statistically significant results”. Obviously much has been said in regards to over -emphasis on chasing statistical significance. Maybe we can more charitably interpret this as designing a study which has the capacity to consistently tell us something about our phenomenon of interest. However, I think this is a vague and unhelpful notion since in practice things are challenging and complex, and we typically have less control than we would like. Also, we could simply crank up the sample size and consistently get statistically significant results, but I don’t want to be too unfair to Fisher and by proxy @PeterM, hence my more charitable idea.
We can make logically valid choices (e.g. “which theory of two (or more) is the better/best one”). Per the linked Notturno quote, a modus tollens process (classified as deductive in logic) where we identify an inconsistency between the falsity of one statement and the truth of other statements. To quote from the link:

The best that a logical argument can do is test the truth of a statement. But it can do this only by showing that its falsity is inconsistent with the truth of other statements that can only be tested and never proved. Our so-called ‘proof’ methods are really techniques for testing consistency. And the demonstration that a ‘conclusion’ follows from a ‘premise’ shows only that the falsity of the ‘conclusion’ is inconsistent with the truth of the ‘premise.’

The inconsistency that marks a valid deductive argument — the inconsistency, that is, between the truth of is premises and the falsity of its conclusion — cannot force us to accept the truth of any belief. But it can force us, if we want to avoid contradicting ourselves, to reexamine our beliefs, and to choose between the truth of some beliefs and the falsity of others — because the falsity of the conclusion of a valid argument is inconsistent with the truth of its premises. …

And this is just another way of saying that what we call a proof actually presents us with the choice between accepting its conclusion and rejecting its premises

Another Fisher quote from the Design of Experiments within a Fisher quote set linked to (“repeated reminders”):

In relation to any experiment we may speak of this hypothesis as the “null hypothesis,” and it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.

It might be argued that if an experiment can disprove the hypothesis that the subject possesses no sensory discrimination between two different sorts of object, it must therefore be able to prove the opposite hypothesis, that she can make some such discrimination. But this last hypothesis, however reasonable or true it may be, is ineligible as a null hypothesis to be tested by experiment, because it is inexact. If it were asserted that the subject would never be wrong in her judgements we should again have an exact hypothesis, and it is easy to see that this hypothesis could be disproved by a single failure, but could never be proved by any finite amount of experimentation. [16]

Combining these two quotes, we can say the competing “theories”, which I slightly rephrase, are:

Subject posssesses no sensory discrimination ability between two different sorts of object
Subject posssesses sensory discrimination ability between two different sorts of object

Some of my own confusion comes from Notturno’s language where they mention choosing between “beliefs” in a vague sense but then (implicitly?) specify the two beliefs as those contained within a single argument rather than competing arguments or statements. Given @PeterM’s claim that this methodology can inform which of two theories is better, I will treat it as a two step process. First we establish an inconsistency within a single “valid deductive argument” then show how that choice sets up another choice.

Let’s also say that the two "theories"are mututally exclusive and exhaustive in “theory space”, thus we can derive statements such as:

If theory #2 (discrimination) is false, then theory #1 (no discrimination) is true
If theory #1 (no discrimination) is false, then theory #2 (discrimination) is true

We can think of this as a light switch where if the switch is not in the off position, it must be in the on position and vice versa. I will also commit the sin of equating the theoretical statements with the null hypothesis, but as many know, we should be clear about delineating tests of statistical hypotheses and tests of theories or theoretical statements. In this toy problem, we are assuming the statistical test is close enough to a test of the “theories”.

First step. #1 is used to formulate a null hypothesis. Without getting into the nuanced weeds, the generic verbal “conclusion” is that the subject cannot accurately discriminate between two different objects. We can then create a blueprint for the design and analysis of a study, or in this case, an experiment.

To rely on @PeterM’s framework, if we repeatedly find the subject guessing correctly with high accuracy (i.e. "statistically significant) in independent experiments, this supports falsity of the conclusion and consequently creates an inconsistency with the premise of theory #1.

This forces us to choose between retaining the conclusion of theory #1 or rejecting its premise and consequently theory #1 in its entirety. So lets say we reject the premise of theory #1 which translates to rejecting theory #1 itself.

Second step. To reconnect the initial tweet which started this thread, “Significance testing is all about how the world isn’t and says nothing about how the world is”. I will consider both scenarios where the experimenter believes in theory #1 or theory #2.

In the scenario where the experimenter believes in theory #1, the inconsistency created in the first step and rejection of the premise of theory #1 allows the experimenter to select a “better” theory, namely theory #2. However, this is a consequence of the theory space which rarely if ever exists in practice. More specifically, we have a low dimensional theory space (d = 2) where theories are mutually exclusive. We can also add that our ability to investigate things is clear, which is the exception rather than the rule in much of research.

In the scenario where the experimenter believes in theory #2, one could say that the results re-affirm the experimenter’s belief or maybe the experimenter even claims its support. One could say this allows the experimenter to retain what is the “better” theory. This scenario is overly simplistic and arguably the more controversial since it seems to have an air of “proving the opposite hypothesis”. I treat this scenario as an acceptable possibility because of the properties of the theory space, which are the next focus.

Properties of the Theory Space. Given the set-up above, it seems like the extended record has the ability to tell us something about the world. I use Fisher’s toy problem because it’s the backdrop of such statements. However, as brought up by @Sander and others, we hardly live in a world like this. Our theory space is of unknown dimensions, our theories are multidimensional, and informative “tests” can be challenging and sometimes impossible to construct.

These practical and frequently pressing issues give credence to and help explain sentiments such as, “Significance testing is all about how the world isn’t and says nothing about how the world is”. Or instead of significance testing, a falsification framework for that matter. For the next problem, I will frame it as a frequentist and NHST procedure.

Using the restaurant example from page 545 of @Sander’s “Induction versus Popper: substance versus semantics”, let’s imagine one is eating at a food truck which serves only one dish and regardless of customer feedback, the owner doesn’t change how the dish is made or what the dish is.
Let’s claim the following:

The evaluation of the food truck, is a mutually exclusive and exhaustive evaluation (good xor bad).
Consequently, we will treat the problem as binomial distributed and assign corresponding probabilities to good and bad like one would with heads and tails.
Many people have told you that the food is amazing or “near perfect”
Good = G, Bad = B

Verbal and Numerical Hypotheses. Some plausible probabilities (P(G), P(B)) for food considered near perfect could be (0.99, 0.01), (0.98, 0.02), (0.95, 0.05), etc. That shows that the numerical null hypotheses or initial beliefs have a higher resolution than the verbal ones. Therefore, we have the problem of many numerical hypotheses mapping on to one verbal one. A starting spectrum for verbal hypotheses worded as frequency of occurrence could be:

The food is nearly always good (0.99, 0.01)
The food is mostly good
The food is equally good and equally bad. (0.5,0.5)
The food is mostly bad
The food is nearly always bad (0.01, 0.99)

The first issue is coming up with a way to map numerical hypotheses to a smaller spectrum of verbal hypotheses. People will have different demarcations. For example, when does “nearly always” become “mostly”. Or from a linguistic and psychological perspective, can we consider “nearly always” as a subset of mostly, thus it may not even arise as a verbal hypothesis and we can subsume it under the “mostly” claim:

The food is mostly good
The food is equally good and equally bad. (0.5,0.5)
The food is mostly bad

Obviously, it depends on what resolution matters. Nevertheless, let’s continue with the numerical null hypothesis (0.99, 0.01) because you’ve heard nothing but amazing things about the food truck. Let’s say you eat at the food truck twice and one of the times is considered bad. Under the null numerical hypothesis, this equates to a probability of 0.0198. Under this condition, we may reject the null, but what are we left with?

Compatibility. At this point, it’s important to briefly mention an idea @Sander has pushed for and one which I like. Re-framing confidence intervals as compatibility intervals to aid interpretation and fight off dichotomania. For now, we can analogously view the rejection based on a small p-value as a declaration of incompatibility between the data and the null numerical hypothesis.

With this idea in mind, we can ask what numerical hypotheses the data is more compatible with in comparison to our current one (0.99, 0.01). This perspective highlights the problem of working in a theory space which isn’t overly simplistic (d = 2, mutually exclusive theories). There are a range of alternative numerical hypotheses with varying compatibility with the data we obtained. Furthermore, the range of numerical hypotheses messily map onto verbal hypotheses we originally formulated.

Consequently, even if we could somewhat find our way out of the statistical testing mess, it’s ambiguous how that should inform our decisions or beliefs in regards to competing “theories” (i.e. which is the better one) which for now correspond to the verbal hypotheses.

This problem doesn’t go away the more you repeat the experiment (i.e. collect more p-values), statistical significance or lack thereof, or where you sit in regards to the semantics of the philosophy of science. It simply comes with the territory of the theory space and conducting research. Notice that all I changed was the dimensionality of the theory space. As many here are aware, there are many other things we can discuss that create further difficulties

Finally, it highlights a routine problem for researchers, which is reasoning about statistical procedures (which NHST is one of many) and their relation to theoretical aspects of one’s substantive inquiry.

Concluding Remarks. Focusing on a narrow conception of scientific practice, knowledge generation, and definitions, will inevitably stonewall this discussion. While we have different knowledge backgrounds, ultimately discussion should be about expanding things. This is especially true when bold claims or as @PeterM describes them “controversial theses” regarding p-values and significance tests are made. These are entities which not only have vast literatures behind them, but also require discussion of mathematical/statistical components, philosophical components, and substantive components. A full dismissal of any of these components will prevent us from having a truly fruitful and nuanced discussion.

Brief Addendum. We can also find examples like Avogadro’s number and imagine mindless use of significance tests. For the original exposition of this idea see Meehl (1990) “Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles That Warrant It” pages 117-118.

3. Belief that @Sander is showing off, not trying to find out what @PeterM means, and provided heaps of texts that have nothing to do with the original post.

Showing off:

This forum is about sharing knowledge and hopefully finding more nuance than we could find elsewhere. Hence, threads like the paper collection on dispelling statistical myths.

@Sander has provided papers which articulate his own thoughts but also other people’s papers/books which he deemed relevant. We are obviously not the only individuals to have ever discussed these topics. Furthermore, as I stated previously, p-values and significance tests rely on mathematical/statistical components, philosophical components, and substantive components. We cannot ignore these components, and bringing them up in a thorough fashion is not only encouraged, but I’m thankful for it.

It is not showing off, it is simply sharing and thoughtful, especially when @Sander’s has an extensive background with the subjects at hand.

Not trying to find out what @PeterM means, nonsensical interpretation, and heaps of irrelevant text, not engaged in a single thing talked about:

Firstly, if one is to claim that the interpretations are nonsensical and the texts are irrelevant, it would be valuable to know why @PeterM thinks this. I’m not sure about anyone else, but I didn’t think what @Sander sent was irrelevant.

As for divining meaning and nonsensical interpretations, part of the journey through complicated topics is coming to mutual understanding regarding the elements of the discussion. However, just because one person may not understand something, doesn’t mean that the thing is nonsensical. What does @PeterM find nonsensical about Sander’s comments?

I’m asking these questions with the aim of learning something.

Furthermore, given that @PeterM did not explicitly specify how they view p-values and significance tests, I don’t think it’s unreasonable for @Sander to provide that context for the group to utilize. @PeterM likely considered this to be rude and may have wanted @Sander to first ask, “Well what do you mean by…” So let me be the one to ask @PeterM:

How do you define a p-value?
How do you define significance tests?

I think dismissing Sander’s contribution is rude and may mean @PeterM might only care about the “philosophicals”, specific quotes from Popper, Fisher, and others, and a discussion that only fits his idea of how it should go and what content it should contain. Additionally, Sander has most definitely attempted to address things @PeterM is talking about.

To provide an example based on a long term project I’ve been working regarding “statistical control” by regression, imagine if I put forth the “controversial thesis”:

Statistical control by regression doesn’t mean anything!

Then @Sander proceeds to say the statement is false if taken literally because if you look at the linear algebra and work out the algebraic derivations of the beta coefficients, you can see that some sort of adjustment is made in the form of residualization/orthogonalization. Sanders could then bring up past literature which discusses the rich history of regression, partial correlation, and even semi-partial correlation.

Now let’s say my working definition of statistical control by regression comes from specific resources or maybe textbooks where it is dubbed “holding all other variables constant”. I may be tempted to view his statements as irrelevant, when in reality they help bring us back to the purest form of the statistical machinery without the human interpretation projected onto it, for better or worse. It would be absurd to say such information is nonsensical, irrelevant, and didn’t engage with my thesis.

Furthermore, the mathematical treatment would then allow us to discuss whether the provided interpretation is misleading. More clearly, when I see statements like “all other variables held constant”, I think of control by stratification. Then the question becomes how well residualization aligns with selecting some strata, putting aside the practical difficulties of obtaining such a sample.

Obviously such a conversation might seem pedantic to some, but I think it comes from a place of questioning methodology and the claims we make regarding what they supposedly do.

4. Supposed insults by @Sander in regards to questioning qualifications and Pope Popper

Qualifications. In my eyes, calling someone a show-off and saying their contribution is irrelevant without engaging with any of it suggests a) lack of background or b) one has the background but deems the content irrelevant.

To better understand where @PeterM is coming from, his blog states he studied Bioinformatics at the FU Berlin and earned a BSc. I looked at the program’s “study regulations” and after imperfectly translating them using google translate, they state things such as:

The core subject bioinformatics is divided into. Study areas of computer science, mathematics and statistics as well as biology / chemistry / biochemistry
The Mathematics and Statistics study area provides basic knowledge and skills in the field of Analysis (differentiation, integration, ordinary differences equations), linear algebra (matrix calculation, eigenvalues, principal axis transformation), in which Statistics (elementary probability theory, statistical basic terms, decision-making, testing and estimating theories, linear statistical methods) as well as in the computer-oriented mathematics (representation of numbers,Stability and fitness, efficiency and complexity grips, numeric linear algebra, numeric quadraticture and integration)

Based on this information, @PeterM presumably has the proper background/capabilities in regards to Sander’s technical treatments, although @Sander did not know this at the time and was making an inference based on the interaction. This presumably eliminates option a) but then if @PeterM has the proper background or capabilities, why did he deem the content irrelevant and refuse to engage with it at all? I’m hoping he is willing to clarify this especially if there are to be continued interactions.

Pope Popper. @Sander criticizes claims within @PeterM’s third thesis and links a paper which directly discusses said claims (e.g. “induction doesn’t work”). Specifically he employs a latin phrase “ex cathedra” which based on Merriam Webster is defined as:

Ex cathedra is a Latin phrase, meaning not “from the cathedral,” but “from the chair.” The phrase does have religious origins though: it was originally applied to decisions made by Popes from their thrones. According to Roman Catholic doctrine, a Pope speaking ex cathedra on issues of faith or morals is infallible. In general use, the phrase has come to be used with regard to statements made by people in positions of authority, and it is often used ironically to describe someone speaking with overbearing or unwarranted self-certainty.

From this, we can see a thematic connection to the Pope Popper statement. Statements such as naive Popperians and Pope Popper understandably come off as antagonistic, insulting, and/or a jab. So a “softer touch” may have been better. But I don’t know where people stand on such comments. While rough around the edges, there is a point to be made about zealousness which can impede discourse.

I can see how copy-paste statements with “unwarranted self-certainty” in regards to induction, evoked such a reaction and play on words to communicate and further Sander’s criticism of Peter’s claims. Furthermore, @Sander stated he has engaged in such conversations for decades and rigid interactions can become tiresome. Finally, I don’t think @Sander had ill intent.

To provide a personal example, when TAing for a undergraduate psychology methods course, we used a textbook which generally defined a small p-value as “unlikely results were due to chance”. I talked with some post docs about my issues with the statement, but the conversation halted at me needing to accept said definition of the p-value because “that’s what the p-value is”. I became increasingly frustrated because the interaction felt “dogmatic” and “zealous” although it was unclear who the Pope was at that time.

Since then, I think it comes from Fisher’s famous and influential lady tasting tea example. In such a context, the null model poses a data generating process driven by chance. To show this, I’m going to employ some terminology from thermodynamics, namely microstates and macrostates (Statistical Interpretation of Entropy and the Second Law of Thermodynamics: The Underlying Explanation | Physics).

The link provides the set up using coin tosses:

For a 2 coin toss scenario, macrostates includes descriptions such as “2 heads” and “1 head, 1 tails”. Microstates include HH, HT, TH.
The insight is that permutations pertain to microstates and combinations pertain to macrostates since “1 head, 1 tails” doesn’t care about the order of said elements

When asking, what is the probability of “two heads”, the answer is:

the number of times this macrostate happens (i.e. number of applicable microstates) / number of all possible microstates

Clearly we don’t care about individual microstates, but meaningful collections of them (i.e. macrostates). Our investigation doesn’t pertain to one particular microstate, especially when each one is equiprobable and forms a uniform distribution.

While words like chance can be vague, the macrostates we identify are constructed from equiprobable microstates thus a distribution representing the macrostates assumes a “chance” model or underlying data generating process. Therefore, it makes sense to word the rejection of the null hypothesis as “due to chance”. More thoroughly and using the common language, we can say “the low p-value suggests it’s unlikely that the subject guessing is due to chance or that the subject is ‘chance guessing’”.

So now we have a plausible explanation for why this “due to chance” language has continued to persist in courses, textbooks, and scientist’s lexicon. Why is this problematic?

Let’s return to coin tossing and test against a binomial distribution where p(H) = 0.9 and p(T) = 0.2. Imagine we toss a “fair coin” and get results which lead us to reject the null hypothesis of an unfair coin. Obviously in this context it doesn’t make sense to say something like “unlikely due to chance”. Therefore, why would one include such a phrase in a proposed general definition of p-values? We can also consider a test of a regression coefficient. Is there any reason to believe that “chance” favors a null hypothesis of 0 or that “chance” underlies the null data generating process?

Unfair coin or unfair toss. Let’s now say:

We have a binomial distribution where p(H) = 0.5 and p(T) = 0.5.
The coin is tossed by one of Dr. Persi Diaconis’ (https://statweb.stanford.edu/~susan/papers/headswithJ.pdf) coin-tossing machines, but we currently don’t know how the machine was adjusted or anything about the coin itself.
We get a large p-value and we retain the null hypothesis that the coin is fair.

Using the chance language definitional framework, does that mean an acceptable rewording is “the coin is ‘chancy’” or “the coin flips according to what we would expect by ‘pure chance’”?

After we reword it, let’s say Dr. Diaconis reveals that the coin is heavily weighted in favor of heads, but he made adjustments to the coin-tossing machine such that the coin toss would function like a “normal toss” of a “fair coin”. Can this data generating process be properly described as “chance”?

Furthermore, imagine we reject the null hypothesis. A typical alternative hypothesis is that the coin is unfair, but as shown in Diaconis et al. paper, the coin-tossing machine can be adjusted such that the coin always lands on the side which was originally facing up. Imagine such a scenario but with a “fair” or an “unfair” coin. Given our lack of knowledge regarding the data generating process, even with a rejection of a null hypothesis, we are unable to say whether the results are due to an unfair toss of a fair coin or a fair toss of an unfair coin. It doesn’t matter how many times we conduct such an “experiment”, how many statistically significant results we get, etc.

Once again, falsification of statistical hypotheses may have little bearing on our “theories”. Returning back to @PeterM’s remarks, it’s hard to see how in more realistic contexts an extended record of statistically significant results can allow us to make “logically valid choices”, let alone any choices.

Concluding and Summarizing Remarks:

Overall, I hope my attempt to summarize the creation and unfolding of the thread was satisfactory. I hope I succeeded in identifying the areas of conflict, confusion, and where individuals can provide clarification. Finally, I hope what I added in regards to the primary discussion proves to stimulate further conversation.

Ultimately we should strive for nuance and plurality rather than dismissal. If @PeterM feels what @Sander contributed is nonsensical interpretation, heaps of irrelevant text, etc., then he will need to support said claims and answer some of the questions I raised (e.g. how do you define p-values).

The show-off remark is obviously not inline with the goals of this forum or any forum for that matter. In terms of the “Pope Popper” remark, while one could see that as a targeted insult, and lighter wording could have been used, I do think @Sander’s word choice wasn’t malicious and was meant to criticize a kind of claim and argument form he has seen in the past by zealous, and what he considers naive, Popperians. In my eyes, it’s a statement of “This claim fits the modus operandi of said individuals” rather than an outright insult such as “show-off”. I think the statement is acceptable, especially when Sander directly follows it with why he thinks so and relevant sources. This shows a genuine desire to debate the claim and express his honest thoughts regarding said claim.

PeterM · September 8, 2021, 12:40pm

@Mark: First off, thanks for actually engaging and trying to understand! I’d like to respond to all of the points you raise, but seeing as you had quite a lot to say, that might take a moment or two to compose.

One thing I should clarify immediately, though, is what you say here:

The fact is: I do not think this. I have neither said nor implied that what Sander posted was “irrelevant”. This is one of the very few places where you don’t actually quote what the person you’re talking about really said—that’s always a red flag. Based on that interpretation, though, you also claim that I “dismissed” Sander’s comments and found them “nonsensical”. That conclusion is not just based on a false premise, it is also wrong. What I did say was that Sander’s initial posts had “nothing to do” with what I was talking about—that they, in fact, ignored what I was talking about. In a subsequent reply, I explicitly said I would “happily engage” with Sander’s points if and when they “should prove relevant to my points”. I don’t think anyone can seriously construe that as “dismissing” those points, about which I also said I assumed that they really are clever. At least, there should be no doubt that the claim that I “refuse to engage with it” is simply false, as I explictily said I would (if…). Lastly, based on your interpretation that I “dismissed” something, you judge that doing so was “rude”. I trust that you see now that that judgement was based on a false premise and thus is simply wrong.

To sum that point up: The answer to your question, “What does @PeterM find nonsensical about Sander’s comments?”, is “Nothing”. My complaint was that it was not pertinent to what I was talking about—and that Sander had made no effort to find out what that was. I also did not “consider this to be rude”; I considered it to be flouting the rules of good-faith, charitable intellectual discussion. Similarly, I didn’t consider the “Pope Popper” remark to be rude; I considered it to be a) childish and b) “interestingly ignorant”—because a) it insinuates a fanatical, mindless, or otherwise uncritical devotion to somebody’s ideas, which is a psychological claim based (in this case) on literally nothing, and b) he is obviously ignorant of how I regard Popper and his views (because he didn’t bother to do any research, which you showed would have been easy) and of how every single one of Popper’s main exponents had at least one major beef with him and/or touted their own interpretation of one of Popper’s ideas instead of his. But then you might say they couldn’t even get the cult thing right…

Lastly, I also didn’t complain about “insults by Sander”. As an adult, I choose not to be insulted by things I can’t take seriously. I simply gave evidence for my claim that there was exactly one person in flagrant breach of the local rules. And “calling someone a show-off … suggests a) lack of background”: One, I didn’t; I called something someone did “showing off”, and I trust I don’t need to explain that the difference is rather important. Two, Sander’s ignoring what I was talking about and instead ostentatiously drawing attention to himself and his ideas and accomplishments does, I’m sorry to say, meet the dictionary definition of “to show off”. And I don’t even mind showing off per se; I would even put it like this: At least Sander has a lot to show off. But the logical conjunction there is important: in the context, I submit, the showing off was inappropriate. (Three, the “suggests” is a very obvious non-sequitur.)

I genuinely hope that clarified one or two things about the “tensions”. Having got that out of the way (hopefully), I’ll give some thought as to how best to reply to your substantive comments. I won’t be too long, I promise.

davidcnorrismd · September 8, 2021, 12:41pm

My summary of the discussion:

One reason it has degenerated to this, is that a (the?) crucial Popperian principle has been ignored. This discussion has been more about mere topics than a clearly formulated problem.

PeterM · September 8, 2021, 2:36pm

Of course it’s relevant, and I explicitly said to which bits, namely the “toleration of inconsistencies”.

Um, yeah, any philosopher of science knows that idea. There is, forgive my bluntness, no more trivial objection to falsificationism, and none that has been refuted more often.

Indeed I haven’t. You’ll be surprised to hear this, but that’s because the point hadn’t come up yet.

There’s even a second reason, to be honest. And that is that Popper of course talked about that point at considerable length—e.g. in LoSD, a book you claim to have read. Furthermore, Popper replied to that same criticism when it was made by Putnam—in his “Replies”, which you quoted in one of your papers and thus should also have read. So even had the point come up, I wouldn’t have wanted to be so rude as to simply assume that you’re ignorant of it.

In any case, here is Popper’s reply to Putnam:

And those sections should be read with what he says in §28 about basic statements: “if we are given a theory t and the initial conditions r, from which we deduce the prediction p, then the statement r·p̄ will be a falsifier of the theory.” This means that the logical situation is such that what actually (logically) falsifies a theory is that we accept the falsity of a prediction and the correctness of all non-theory elements of our premises: initial conditions, auxiliary hypotheses etc. That is the simple logical structure that most people (Lakatos too, btw) have to ignore when they claim that “no theory is falsifiable”. Because falsifiability, as I hardly need remind you, is falsifiability in principle.

Of course this doesn’t mean that we can ever be sure which elements of the above logical situation are, in fact, true and which are false. That pertains to all of the elements, though: theories, auxiliary hypotheses, initial conditions, predictions. The question, in Popper’s view, is never whether we can be sure. No, we can’t. That is taken as given. But we can test critically—all of the elements of our argument. When, for example, the OPERA experiment people thought they just possibly had detected FTL neutrinos, there was a) considerable buzz about the possibility of relativity theory being shown wrong (i.e., the possibility of its refutation was widely accepted) and b) they proceeded to critically test all their non-theory premises. And they finally found some of those premises to be faulty, so that, in practice, the theory wasn’t even contradicted, let alone falsified.

Mark · September 8, 2021, 2:58pm

To respond to your first reply. Thank you for sharing your thoughts (e.g. what you felt about the Pope Popper reference). You stated a few things:

The fact is: I do not think this. I have neither said nor implied that what Sander posted was “irrelevant”. This is one of the very few places where you don’t actually quote what the person you’re talking about really said—that’s always a red flag. Based on that interpretation, though, you also claim that I “dismissed” Sander’s comments and found them “nonsensical”. That conclusion is not just based on a false premise, it is also wrong. What I did say was that Sander’s initial posts had “nothing to do” with what I was talking about—that they, in fact, ignored what I was talking about. In a subsequent reply, I explicitly said I would “happily engage” with Sander’s points if and when they “should prove relevant to my points”

Before I said that, I quoted the following passage:

Well, let’s see how we can make sense of this. Firstly, I’d like to reiterate that I’m interested in a discussion that is about possibly learning something (hence the explicit invitation to ask questions)—and that can’t really be achieved by a) simply finding one interpretation of what somebody else said that makes no sense and proceed to lambaste it and b) lengthily show off how clever we are without ever trying to find out what the other person actually means. Because then one ends up with heaps of text that have nothing to do with what that person is saying—an unfortunate state of affairs.

I then proceeded to say “We have a variety of statements from @PeterM which can be summarized as follows:” and listed 3 things with the 3rd one being:

Belief that @Sander is showing off, not trying to find out what @PeterM means, and provided heaps of texts that have nothing to do with the original post.

Following this you quote this:

Firstly, if one is to claim that the interpretations are nonsensical and the texts are irrelevant, it would be valuable to know why @PeterM thinks this. I’m not sure about anyone else, but I didn’t think what @Sander sent was irrelevant.

Based on your reply you have claimed:

I have neither said nor implied that what Sander posted was “irrelevant”.
Based on that interpretation, though, you also claim that I “dismissed” Sander’s comments and found them “nonsensical”. That conclusion is not just based on a false premise, it is also wrong. What I did say was that Sander’s initial posts had “nothing to do” with what I was talking about—that they, in fact, ignored…
“One, I didn’t; I called something someone did “showing off”, and I trust I don’t need to explain that the difference is rather important.”

While my summary point or “interpretation” wasn’t a direct quote let me point out the relevant passages which lead me to say such things:

For #1 - "never implied that what Sander posted was “irrelevant”.

Because then one ends up with heaps of text that have nothing to do with what that person is saying—an unfortunate state of affairs.

Doesn’t “nothing to do with what the person is saying” imply irrelevance? How do you define irrelevant? How can something not be irrelevant yet have nothing to do with something else?

For #2 - “Based on that interpretation, though, you also claim that I “dismissed” Sander’s comments and found them “nonsensical”.”

…simply finding one interpretation of what somebody else said that makes no sense

You say they made no sense hence my use of “nonsensical”, a word which quite literally means “making no sense”. Now you did say his “interpretation”. Maybe you think that “interpretation” only applies to some of what he said and my wording of “comments” is too general. Obviously at the time of writing, I couldn’t divine the exact parts which you did or didn’t find nonsensical (i.e. making no sense), but that’s because you didn’t identify said parts (assuming you didn’t mean all that @Sander said).

In regards to dismissed, I don’t think it’s unreasonable to have found your initial response to be “dismissive”. That’s all I have to say on the matter since I’m already concerned about how you think nonsensical is an incorrect rewording of “makes no sense” or based on a “false premise”.

For #3 - “One, I didn’t; I called something someone did “showing off”, and I trust I don’t need to explain that the difference is rather important.”

…lengthily show off how clever we are without ever trying to find out what the other person actually means.

Ok you are correct that you didn’t literally call @Sander a show-off but claimed he committed the act of showing off.

Do you really think it’s incorrect or unreasonable in the realm of interpersonal communication to assume that saying someone is showing off labels them a show-off by implication? I see we are split here.

For you, it’s like calling someone dancing at a party a “dancer”. The former doesn’t imply the latter. However, a dancer here translates to a profession or maybe someone who engages more formally and intensely in the hobby but doesn’t do it as a job. So my empathetic example doesn’t really apply. You might also say “showing off” once doesn’t imply one is a “show-off” because one needs a history of behavior to eventually earn the formal label. At this point, this is far too pedantic.

For me and arguably most people, it’s like accusing someone of committing a crime and implying the label of “criminal” or “alleged criminal”. Which isn’t unreasonable. If I said you implied he was an “alleged show-off”, is that a compromise despite being far too pedantic for that point I was making.

Consequently, please explain the difference, why you think it’s important, and why you think it invalidates the essence of what I’m trying to say. Regardless of explicitly calling someone a show off or simply saying they are showing off, it’s rude and unhelpful and the case of @Sander, many would agree it’s incorrect.

Instead of focusing on the pedantic semantics and making this like a debate or something, I’d much rather you say/be like,

“Hey, now that I think about it, I shouldn’t have said @Sander is showing off even if it’s how I think he came off. He seems to be passionate about the subject and have an extensive background, thus has a lot to share. Showing off is the less charitable interpretation and saying that only instigates conflict and shuts down discourse. Furthermore, people will understandably interpret that as me calling him a show-off, which while I think is a different thing, I understand that in most interpersonal interactions that will be the ‘normal’ interpretation”

Then we can move forward. There is nothing to gain from saying that someone is showing off.

I could say that you are “showing off”. But then we would debate about that point, the semantics, etc. And where would that get us? Nowhere. And it began as me insulting you (regardless of whether I meant to or not) an obvious unproductive activity.

Mark · September 8, 2021, 3:25pm

I agree. Obviously the thread was started and centered around theses. Personally, I think the whole p-value/significance debate focuses on the wrong thing. The underlying problem is not the methods but how we train scientists to use them and the pressure cooker ecosystem.

Researchers misuse and misinterpret tons of different tools. This will likely be the case till the end of time or until we properly intervene where it is meaningful. In lady tasting tea, p-values and such may suffice for making a decision. Although, I would say we can employ some qualitative analysis or explore the lady’s response process to more clearly investigate why she is able to guess correctly.

In regards to the general problem of theory building, falsification, gaining knowledge, etc. while some philosophy of science is nice to consider, ultimately the problems are part specific to a discipline and general (e.g. fuzzy variables, latent constructs, poor measurement, difficult to design informative study).

This is why I prefer to read and talk about more down to earth problems because within them there is a lot of “philosophy”.

What is the thing? How do we measure it? What are the pros and cons of our measurement and study design? What are our ultimate aims? Can we interpret these results the way we do? Research degrees of freedom/sensitivity analysis/multiverse/garden of forking paths.

In my opinion, if we engage with these things in a manner which is too far removed or too generic/vague, things cease to be meaningful or applicable.

A brief example, depression sum scores. I have a folder with critical papers regarding this. Despite them focusing on problem in a sub-discipline of psychology, what one can extract from that literature applies to meta-analysis quality sum scores (something @Sander has written about) and really any context that employs the practice or makes inference off of them.

This is the value I see from more “meta” investigations or in the case of this example, “philosophy of measurement”.

Sidebar. In fact, with the recent power posing replication with shared data on Twitter, I’d love to play around with some of that data.

PeterM · September 8, 2021, 4:32pm

I’ll try and keep this reply short. I’ll explain one of those point using this claim of yours: “The show-off remark is obviously not inline with the goals of this forum or any forum for that matter.” On the contrary, I made that remark precisely in order to show that the goals (more specifically: the rules) of this forum had been violated by not bothering to criticise the OP’s ideas (or even to find out what they were) and instead criticising the person. (Remember that “pure dogmatism”, “ex cathedra pronouncements”, and “Pope Popper” were inserted into the discussion before I even remarked on the curious imbalance between the dismissal of what I was talking about as “careless”, “naive”, and “absurd” and the voluminous display of ideas whose relevance to what I was talking about noone has yet cared to explain.)

With respect to “relevance”: We all agree that “unlikely” isn’t the same thing as “unlinkely under a certain hypothesis”. In exactly the same way, “irrelevant” isn’t the same thing as “irrelevant to what I’m saying”. It’s not just okay to make that distinction in such a forum, that’s one of the things such a forum is about.

With respect to “nonsensical”: That referred explicitly and specifically to nothing else but Sander’s interpretation of my thesis #2—that it “is false if taken literally”. Well, no shit. Apart from the fact that there is no such thing as “the one literal interpretation” of anything, making the observation that there is an interpretation that is false is completely pointless—unless it happens to be mine. But Sander didn’t even care to ask whether it was. And it isn’t.

And the difference between showing off and being a show-off is precisely that between what somebody is doing (which my be criticised, eg as a breach of a forum rule) and a judgement of their person. If you think making that distinction, as we are instructed to do by the forum rules, is “pedantic” (which is defined as “giving too much attention to rules”), then I can only say: If you don’t care about people following the forum rules, then maybe you’re in the wrong place.

Bonus question, since “insults” have been mentioned: Would you like to guess which of the two words, “show off” or “pedantic”, Merriam-Webster declares to be “an insulting word”?

Then we can move forward.

Sander · September 8, 2021, 4:59pm

David: Sorry, but the principle seems a good example of the sort of vague or overstretched advice which (whether or not Popper gave it) I find in need of sharpening to be of much use let alone taken as a general guide. No doubt it applies sometimes (including in this thread) but far from always. For example, a mere description of studies of a topic can be of immense use to a reader who already knows the questions that motivated the studies, or is approaching them with questions of their own. It’s OK (and I think more often than not even advisable) to leave no problem solved and no question answered (beyond questions about important basic facts of the matter).

In light of the acrimony so far, please bear in mind that my criticism should not be taken as condemnation of the advice, but rather as pointing to shortcomings in its expression. Those may be best addressed by the offeror (in this case, you); indeed, I think the offeror should bear primary responsibility for improvements.

As a small technical improvement it would be better (for me anyway) if text quotes were not given as images but instead as plain text that could be copied piecemeal as needed for quotation in response.

On the Guernica analogy, I think it off-base if only because the current bloodshed has been somewhat internecine warfare (more like Bolsheviks vs Mensheviks or SS vs SA, rather than left vs right), for in the end I’m all for bringing forth falsificationism as an antidote for the antifrequentist extremism I’ve encountered among certain Bayesians and others who would ban or indeed have banned P-values and even CIs from statistical presentations.

[BTW I was just chastised by this system for having “posted more than 26% of the replies here”, remindful that I have neglected other responsibilities in replying here.]

Mark · September 8, 2021, 5:28pm

I don’t want to dismiss what you just replied with. However, given how things have gone so far, would it be alright if we just move forward and you tell me:

How do you define p-values?
How do you define significance tests?

Furthermore, I provided some toy problems that aimed to respectfully criticize the content of your theses among other points. Hopefully we can use those as discussion points among other things (e.g. clarification regarding the Notturno quote and connecting it with your statement regarding choosing theories).

But first, I want to understand where you are coming from and how you define the things mentioned above.

PeterM · September 8, 2021, 9:38pm

I’d be happy to use Sander’s definition: “the probability that the chosen test statistic would have been at least as [extreme] as its observed value if every model assumption were correct”.
Any procedure of statistical analysis that relates the behaviour of observable random variables to a (null) hypothesis and its corresponding probability distribution(s) and uses a significance threshold for categorising results.

I hope that does it.

R_cubed · September 9, 2021, 3:51pm

Carnap directly criticized this proposition:

Blockquote
Anyone who’s even heard of Popper knows that he advanced his falsifiability criterion (of science) in opposition to the Vienna Circle’s verifiability criterion (of meaning). What the Popper fans seem unaware of (yes, I know, they’re no longer as numerous as in the days of Helmut Schmidt’s public endorsement) is that Carnap actually responded to this criticism by pointing out that you can’t distinguish between verification and falsification except in certain special cases, which happen not to include the laws of e.g. physics. That was in “Testability and Meaning” (1936-7), which Popper praised volubly, e.g. in his Schilpp volume. But despite the fatal consequences of Carnap’s argument for Popper’s whole edifice, Popper never responded, nor, as far as I’m aware, has any of his followers.

pedro · September 9, 2021, 5:19pm

It’s also wrong to say that Popper thought singular statements can be verified. They cannot, according to Popper. He rejected the distinction between observational and theoretical statements, so even singular statements can be theory-laden and therefore unfalsifiable

The statement, ‘Here is a glass of water’ cannot be verified by any observational experience. The reason is that the universals which appear in it cannot be correlated with any specific sense-experience. (An ‘immediate experience’ is only once ‘immediately given’; it is unique.) By the word ‘glass’, for example, we denote physical bodies which exhibit a certain law-like behaviour, and the same holds for the word ‘water’. Universals cannot be reduced to classes of experiences; they cannot be ‘constituted
(The Logic of Scientific Discovery, Chapter 5, Section 25)

The sense in which a universal generalization can be falsified is that a singular statement can contradict it (I’m ignoring that auxiliary hypothesis exist here for the sake of argument). But singular statements don’t fall from the sky. We have to learn that they are true and there can be uncertainty about them, so it’s an inductive process. Logical relations by themselves do not actually falsify or verify anything (that’s why Susan Haack calls him a “logical negativist”).

If we can’t verify basic statements, we cannot falsify theories in any normal sense of the term. We can’t even learn that theories are false or probably false. To his credit, Popper recognized all this, but I see it as a reductio ad absurdum of his own ideas.

This argument is elaborated here

How can we know whether [potential falsifiers] are true/false, and thus put them to work in the testing of universal statements? Oddly (or perhaps ironically) enough, this is not a question which would have caused a Hume or a Schlick a moment’s hesitation, since on their analysis basic propositions are reports of direct experience, and as such are impervious to error. It is precisely Popper’s (in my view, justified) rejection of such crude empiricism that rules out such an answer for him, and throws up again the crucial methodological question as to how basic statements can be known. In terms of the (oversimplified) modus tollens construction given above, how, if ever, can we know that ‘X is false’? His answer is that we cannot: ‘basic statements are not justifiable by our immediate experiences, but are … accepted by an act, a free decision’.

This seems to be a sophisticated form of conventionalism—the only condition for the acceptance of a ‘contradictory’ basic statement is that it should simultaneously corroborate a falsifying hypothesis: ‘‘If accepted basic statements contradict a theory, then we take them as providing sufficient grounds for its falsification only if they corroborate a falsifying hypothesis at the same time’. Corroboration, however, is invariably provisional, open-ended and revisable, which entails that falsification can never be conclusive.
(…)
However, it is the suggestion alone that theoretical falsification can sometimes be conclusive and final, and be known to be conclusive and final, that has given Popperian falsificationism its longstanding appeal, and has separated it from the ‘underdeterminationism’ of those who, like Quine and Lakatos, have argued for holistic conclusions.

Sander · September 9, 2021, 5:54pm

Thanks very much R-cubed for bringing in Carnap and that 2015 blog discussion (including the comments) at Popper and Carnap again | Carnap Blog
-I have now read it and recommend it to everyone here.
Notably, toward the end of that discussion Josh asked
“how is there not complete and total symmetry between Popper’s theory and verificationism? It seems to me that Popper merely has an arbitrary preference for universals over particulars, hence his bias towards falsification and away from verification. Could you possibly summarise Popper’s rationalisation for this preference?”
To which awcarus replied:
“There is a symmetry, yes, but Popper wasn’t quite as dumb as that. I regret that he didn’t respond to Carnap’s challenge, because I’m convinced that the young, pre-dogmatic Popper probably could have, somehow or another, and possibly kept a debate going. The best way to see Popper, perhaps, is as the attempt to take Hume’s challenge to induction really seriously. Carnap never ignored or denied Hume’s challenge; in fact he acknowledged it fully even in his Physikalische Begriffsbildung of 1926, right up front, long before Popper. At the same time, he saw that induction is clearly central to science, and that what we have to figure out is how we manage to use it despite its obvious logical flaws; that’s the task he devoted the last 25 years of his life to. Popper became established and dogmatic, and no longer tried to engage in debate about this issue, and after ‘Testability and Meaning’ Hempel in effect took over from Popper the role of the skeptic about induction. The best answer to Popper’s skepticism (indeed to the whole tradition of skepticism about science, from Hegel to Duhem to Popper and beyond) is actually to be found not in Carnap but in some of George Smith’s papers…”
I’ve read Hempel and Carnap (Carnap was still around UC and influential there when I was an undergrad) but I had not heard of Smith so I will look forward to reading the citations.

Sander · September 9, 2021, 6:40pm

Thanks Pedro for that.
Years ago Haack and I discussed a number of epistemological issues (especially as they relate to law and the courts) as we share an interest in applying philosophy in science and technology (which includes interest in so-called “deviant logics”); eg see "Putting Philosophy to Work: Inquiry and Its Place in Culture -- Essays" by Susan Haack (2008/2013) and Evidence Matters (2014)
One might say she is no fan of Popper or his followers (and likely first to be burned at the stake should they ever take power, before they come for me).
See also this article The Value of Risk-Factor ("Black-Box") Epidemiology on JSTOR. Confusingly, the accompanying comments by Susser, Haack, Mayo & Spanos, and Weiss and our response Author's Response on JSTOR were printed before the article (starting on p. 519 in the same issue - see Vol. 15, No. 5, Sep., 2004 of Epidemiology on JSTOR).
I shall look forward to reading the Thornton article you cited.

PeterM · September 9, 2021, 9:56pm

But we should, of course, mainly be interested in his actual argument. Could you maybe give a short outline of it?

PeterM · September 9, 2021, 10:15pm

That’s simply wrong. You quote one sentence that seems to bear out your claim: “The statement, ‘Here is a glass of water’ cannot be verified by any observational experience.” But you not only ignore the context but you also didn’t think to look elsewhere in LoSD in order to find out whether your interpretation might possibly be mistaken (what Popper calls ‘the critical attitude’).

What context did you ignore? Well, it’s actually right there in the sentence: “by any observational experience”. That’s what §25 is all about: the rejection of the idea that observational experience alone can decide what is true. The beginning of §25 says it pretty clearly:

The doctrine that the empirical sciences are reducible to sense-perceptions, and thus to our experiences, is one which many accept as obvious beyond all question. However, this doctrine stands or falls with inductive logic, and is here rejected along with it.

Where did he directly contradict your interpretation? In §3, where he talks about “the testing of the theory by way of empirical applications of the conclusions which can be derived from it”. The last step of that stage is that

we seek a decision as regards these (and other) derived statements by comparing them with the results of practical applications and experiments. If this decision is positive, that is, if the singular conclusions turn out to be acceptable, or veriﬁed, then the theory has, for the time being, passed its test: we have found no reason to discard it. But if the decision is negative, or in other words, if the conclusions have been falsiﬁed, then their falsiﬁcation also falsiﬁes the theory from which they were logically deduced. (Popper’s emphasis)

And, of course, the statement that puts the above-mentioned asymmetry in two neat sentences:

[B]oth kinds of strict statements, strictly existential and strictly universal, are in principle empirically
decidable; each, however, in one way only: they are unilaterally decidable. Whenever it is found that something exists here or there, a strictly existential statement may thereby be veriﬁed, or a universal one falsiﬁed.

Thus your whole criticism collapses. “If we can’t verify basic statements, we cannot falsify theories in any normal sense of the term.” But we certainly can.