In summary: “There is nothing new under the sun but there are lots of old things we do not know”
Bierce A. The Devil’s Dictionary. Oxford University Press; USA: 1999.
In summary: “There is nothing new under the sun but there are lots of old things we do not know”
Bierce A. The Devil’s Dictionary. Oxford University Press; USA: 1999.
But if the intra-annual seasonality is expected to have big jumps at the boundaries, as is the case with, say, sales data that is heavily concentrated in November and December, but rapidly plummets in January due to the holidays, then cyclic splines are sub-optimal, right?
That seems to imply that the interpretation of a p-value for a given study depends on previous narratives and beliefs. For example, in an exploratory setting, obtaining a low p-value will only imply support for the rationale under construction, without any confirmatory element. In contrast, after a robust rationale and a rigorous design (e.g., RCT), a low p-value in general will imply strong support for the hypothesis. Confirmation also requires auditing all aspects at play in the model (e.g., type of randomization, accuracy of data, assumptions, absence of bias or alternative explanations, etc.). The difference is subtle, but it seems to be there. It is expected but curious, being subjectivity the main critique of frequentists to Bayesians. The p-value is only a cognitive aid on which to build stronger rationales by inductive route, or try to reject beliefs by the hypothetical-deductive route, according to Popper. Right?
Yes to what I think is your core sentiment, except for an important aspect of P-values: Those are purely refutational, a point seen in Fisher and others going back to Pearson 1906 warning against interpreting large p as showing the tested hypothesis is true, or “supporting” it. Likewise a small P-value does not support or confirm the alternative except within very narrow contextual confines (e.g., as in the LHC experiments in which the Standard Model along with extensive equipment checking confines the interpretation of the Higgs-detection experiments). Support is relative but P-values are absolute measures; you need further structure (such as a likelihood function) to measure support. That caution fits well with your remarks about auditing and falsificationist approaches, but warns against the common practice of taking P-values as “inductive” or “confirmatory”.
Nice thoughs. I think you mean that you can use the likelihood function to estimate data support for H0 vs H1, as likelihoodists do, and you can create a decision rule with this criterion. I guess you mean that p-value does something subtly different, it does not measure the statistical support for a hypothesis, but the degree of surprise if H0 were true. Right?
In any case, the interpretation of the degree of surprise may depend on what we thought about H0 before the study. For example, at CERN they really had very good a priori reasons to reject H0 (in the early months they were declaring they would be very surprised if the Higgs boson does not exist). What would have happened if the CERN experiment had been done before Peter Higgs had proposed its theory? Perhaps it would have been impossible because physicists would not have known where to look, and this is one of the beauties of the hypothetical-deductive method. But let us make the mental experience of imagining that for some reason the experiment would have been carried out.
Since there was no rational process behind that hypothetical experiment, the interpretation of the p-value (or of the sigmas), would not have been similar with respect to the real experiment. That is, same p-value, different interpretation for sure depending on the research context. Therefore, the p-value always needs the theory, and it is not true that it is so objective.
My point here was to bring that idea to the articles of medicine, which are much more empirical than frontier physics, and where rationales sometimes seem solid, but are not so solid at times (e.g., publication bias, cherry-pickering…), or directly there is nothing but conjecture. The p-value would be interpreted in the light of the rationale, which is curious, because the rationale is often overshadowed once the p-value is reported.
And this leads again to the problem with nullism, and why the entire theoretical structure is needed.
Again, I find your basic sentiments agreeable, but there are technical details that need more attention. First, any statistic has an “objective” interpretation based on the pure math in its definition and deduction/derivation. In the case of frequentist testing, the S-value, not the P-value, is the surprisal measure; and from the math we can interpret it as a transform of the original test statistic T to a form S that is exponential under assumptions (model) M used for the derivation (unit exponential when using nats as the measurement unit), which has cognitive scaling advantage (e.g., equal interval scaling) over P. In this sense of “objective” (same for everyone who understands the math), S is one possible measure of information provided by looking at the data along the T axis using only M as the decoder.
The contextual considerations you raise that modify our personal “subjective” surprise at its value are not built into M; and if we want to formally account for them we must build them into M. But in the physics example all that was completely unnecessary, as the test model M was simply a null reference point within the context (background assumption set) of the Standard Model A; the tested hypothesis H (= no Higgs) was not a previously “established” condition that they were trying to hold on to or they preferred to be the case.
Contrast that to med research, where there are often parties with conflicting preferences. I have seen that usually each party (and editors) will misinterpret the same statistics in opposing directions, to suit their preference - especially when the results are ambiguous and thus easy to spin in any preferred direction. Again, med journals are replete with examples. This situation suggests the major problem with testing is subjective preference synergizing with dichotomania. Here, nullism represents one common (and often imposed) preference that distorts interpretations of ambiguous results; its opponent is pressure to find something “significant”. These subjective forces are not part of the stat math although they can dramatically influence the choice of M and thus the reported results.
As far as I know, the Higgs boson was already part of the standard model since 1967, so the usual structure of hypothesis testing distinguishing H0 vs H1 should have been foggy. In any case, besides rejecting the model without Higgs, the new particle had exactly the features predicted by the standard model. So it was not only to reject H0, but to find something really nice by induction.
What would have happened if there had not been a previous theoretical framework? Maybe the experiment would not have been done. In any case, theoretical physicists would be right now thinking about its deepest meaning, trying to arrive to the Higgs underlying theory, but at the same time, the engineers would still be checking that the wires were properly laid.
As far as medicine is concerned, everything seems even more ambiguous, as you say, because the evidence is usually much weaker, not being able to perform such rigorous statistical experiments on many occasions. In addition, there are many more incentives to make disparate interpretations, since physicians must make urgent decisions about their patients at all times.
This is why I have the feeling that a p-value is just an attempt to build a useful cognitive aid, to guide us while we try to learn difficult things.
The manuscript summarizing key points from this thread was just published. I believe this is the first paper to come out of the datamethods discussions. Thank you @f2harrell for hosting this forum, and @Sander for rigorously course correcting us at every step.
Great to see this! Is there a preprint version of the paper? Our medical library does not have this article.
I’ll email you now the current pre-proof version and then the final version after it’s been proof-read.
Many thanks to everyone for the thread, and especially to Pavlos, without whose help this manuscript would not have been possible. As a corollary, one of the most interesting aspects, which I have learned from this article, is how many ways there are to configure a categorical variable. For a 12 level factor, the possible number of models is the number of Bell minus one, i.e. more than 4 million possible Cox models for a survival analysis, all of them equally feasible.
This has caught my attention because in oncology we use categorical variables all the time (e.g., location of the primary tumour, type of response to chemotherapy, location of metastases, etc.). It is not uncommon to see how variable levels are often arbitrarily grouped in RCTs. One example I remember is the RCT RADIANT-4 which arbitrarily clustered several tumours. Everolimus for the treatment of advanced, non-functional neuroendocrine tumours of the lung or gastrointestinal tract (RADIANT-4): a randomised, placebo-controlled, phase 3 study - PubMed
In summary, my feeling is that this issue of categorical variables is rather more complex and deeper than I had originally thought.
This figure shows the number of individuals who are born on a specific day in a sequence of birth cohorts – just rolled around the calendar year. It is not difficult to guess why I call it “Happy Holiday”
Ps - please do not be angry with me for sidetracking this wonderful discussion with a light-hearted comment – but the weekend is approaching.