Predicting survival of cancer patients using the horoscope: astrology, causal inference, seasonality, frequentist & Bayesian approach

On first look, that seems fine except for a (very common) mistake:
A P-value is not “the probability of the observed data under H0”
-that probability is called the likelihood under H0. A P-value is only the probability of the observed tail event under H0.

Also, as a numeric aside for practice: Remember that our real P-values are (like all statistics) discrete, and are not exactly uniform even if computed from exact tail areas; plus they are usually approximations to those tail areas. At best (as with Fisher-exact statistics and with many composite hypotheses) they are suprauniform, and thus are “conservative” (understate the exact tail probability and thus understate information). In apps like ours (testing a GLM regression coefficient in a single data set with adequate numbers for all approximations) this is a small error source (typically <1 bit), especially if we use score statistics instead of more crude Wald approximations. But in small-count or sparse-data examples like you are describing it can create some discomforting discrepancies from idealized limiting behavior.

2 Likes

The classic game of Mastermind is how I learned to understand the concepts. There are connections to the design of experiments I’m just beginning to explore now.

An open source version of the game of Mastermind I played had a code maker that would create a sequence of 4 colors out of a set of 8.

The total number of possible codes 2^{3*4} = 2^{12} = 4096 possible codes (assuming repetition is allowed).

Working out how much information you gain at various possible states and stages of the game is a good exercise.

  1. Color and position.
  2. Eliminating a color.
  3. Eliminating a position.

What do the various algorithms discovered to solve this game have to teach about experimental design?

Addendum: The distinction between “evidence” and “surprise” is critical, and was not at all obvious to me until I studied the misinterpretations of p-values in depth. The following paper is one of the few I’ve found that addresses the concept of surprise from a formal Bayesian POV.

1 Like

The translation of this concept into the common language can be a minefield. What I understand by p-value is the probability of obtaining observed data as extreme or more than mine if H0 is true. I guess that’s what you mean by tail event, right?
Therefore, p-value=1/4 implies that my data provides 2 bits against the null hypothesis.

Yes, a dense minefield and rarely traversed safely. A major hazard is misuse of “the data”: No, neither P nor S refer to “the data” (a microstate); rather, P refers to the ordinal position (quantile) at which the observed data falls according to some statistic T that maps the data into the real line.
T is supposed to “summarize the data” along a particular axis, encoding information about the discrepancy/distance between the observed state (the data) and what we’d expect under the model along a particular dimension of interest in the sample (observation) space. This is the sense in which p indicates a degree of surprise at the observed value t of T, under the model, which can be measured by the S-value (as well as by other, functionally related measures, e.g. sigmas as used in physics).

4 Likes

Is it possible that on some occasions the difference is merely a matter of semantics?
For example, you do a permutation test to compare the means in two groups, and for that you use as a statistic the difference in means of each permutation assuming H0. This statistic has a certain distribution over which you map your real result.
For example, I observe an actual mean difference of 3 units.
Then I do 100 random permutations assuming H0 and I get that in 10 of them, there is a difference of 3 or more units.
The p-value = 10/100 in that case actually represents the probability of an outcome at least as extreme as yours within the distribution of that statistic in the set of permutations under H0.
It is also true that I have placed the statistic on an ordered line and checked the distance from expected values.
However, wouldn’t both definitions be true?
In this example I do not understand why the nuance is relevant in real terms.

Can you summarize the difference between evidence and surprise in intuitive terms, please?

Others may disagree, but no, I would not say it is a matter of semantics; more a matter of getting sloppy with words and then risking serious error in missing a profound limitation of P-values and any single-number (scalar) statistic: It can only be carrying 1 dimension of information (the information in one data feature about a corresponding data-generator feature). The statistic from which p is computed is not “the data”, it is one small aspect of it among many (hopefully one chosen for good or at least clear reasons, but more often just a habit). In your example, why did you use the sample mean as the statistic? Why not the median? What if the underlying generator is Cauchy (which does arise in physics) with unknown location?

Over on Twitter last week I got into heated discussion over how you should never cut corners when discussing P-values (or probably any formal statistic). I was among those who argued that any attempts to simplify or omit details in build-up or get too loose or poetic will come back to haunt us with user misconceptions and application errors of the sort that dominate “soft science” research literature today. It’s a mistake I’ve made at times!

2 Likes

Ok, after this explanation completely understood, and fully convinced. It is surprising how far a discussion about the horoscope has come. I am thinking that perhaps this thread could be summarized and sent somewhere as a reflection on the meaning of significant p-values arising from hypothesis-free studies in medical registries with many variables.

Yes but should be on the meaning of P-values …significant or not!

Yes, better … the notes of this thread allows that comment that can even be funny. If anyone is interested in helping, please tell me. Thank you all for the thread.

Measures of surprise only require the specification of a single model. Surprise is an absolute concept in the sense that the S-value provides bits of refutation conditional on the model being true.

Aside from the log transform Sander has written about, p values have a number of transformations that can be used for different purposes. Instead of the log transformation, using \Phi(p) puts the information on the probit scale, and is closely related to the non-centrality parameter used in N-P power analysis.

Measures of (relative) evidence require the specification of 2 models, if we take the arguments in Richard Royall’s instructive monograph Statistical Evidence, or A.W. Edwards in Likelihood seriously.

Likelihoods have all of the properties one would want in an evidence metric.

Consider the case of equivalence or non-inferiority studies. All of this is relatively easy in the likelihood or Bayesian case. It is “simply” the likelihood ratio of the null vs the alternative (this may not be easy to calculate!).

You can hack methods based on p-values to do interval tests, but you are choosing a decision rule, not making any particular claim about the models conditional on the data.

From the abstract on that paper on the use of surprise in Bayesian analysis:

Blockquote
Measures of surprise refer to quantifications of the degree of incompatibility of data with some hypothesized model H_0 without any reference to alternative models. Traditional measures of surprise have been the p-values, which are however known to grossly overestimate the evidence against H_0 [when incorrectly interpreted as evidential metrics – my emphasis]. Strict Bayesian analysis calls for an explicit specification of all possible alternatives to H_0 so Bayesians have not made routine use of measures of surprise.

1 Like

Ok, after this explanation completely understood, and fully convinced. It is surprising how far a discussion about the horoscope has come. I am thinking that perhaps this thread could be summarized and sent somewhere as a reflection on the meaning of significant p-values arising from hypothesis-free studies in medical registries with many variables.

I think there is certainly a rationale and interest among oncologists for you all to write up key take-home messages from this thread. The original post is very similar to this example of a subgroup analysis of the STAMPEDE trial in prostate cancer where the authors found p=0.0021 for the interaction between day of diagnosis and treatment. One difference may be that the authors there deliberately trawled through various implausible subgroups until they found one with low p-value which I do not think is what happened here.

Inspired by @Sander and I.J. Good we have been successfully using S-values for the past 2 years to help hematology/oncology fellows at MD Anderson intuit p-values as measures of surprise against the model (including all background assumptions). Among other things, they are pedagogically the best way I have found thus far to help us clinicians move away from arbitrary dichotomizations. The feedback has been enthusiastic and I can see the gradual improvement over time in the way multiple choice questions from clinical research scenarios are being answered. We subsequently used S-values as well in an elearning course on p-values by the American Society of Clinical Oncology with good feedback, again suggesting that this approach has practical/conceptual advantages that are embraced by clinicians and researchers. Claude Shannon’s “A Mathematical Theory of Communication” is one of my favorite papers in general and provides key contextual understanding of the concept of surprisal.

2 Likes

Perhaps you both (Pavlos and Alberto) should write up those key messages. I will for the moment leave you with the idea that there is no such thing as “hypothesis free”: The very fact that the database contains a variable indicates it was thought important enough for health research to measure and include. This is why you will find hematocrit but not fingernail length in medical records. The distinction is I think better seen as one of determining goals: exploratory (selecting hypotheses from an initial cross-classification of possibilities for further pursuit) vs confirmatory (pursuing pre-selected hypotheses). But there have been many articles and book sections about this issue - and just about everything else!

4 Likes

@albertoca, happy to assist you.

Thank you @Pavlos_Msaouel, I need to think about how to organize it taking into account how the thread began. I’ll think about it for a few days and write you.
With respect to hypothesis-free analysis, it is very common for researchers to analyze things, just because they can, without any substrate behind it.
A few years ago I made a nomogram predicting the patient’s age, to make fun of urologists, who had published more than 1000 nomograms on prostate cancer. Prediction is not causal inference, but it is common to make models with databases that were not even remotely designed for that purpose.One example is Khorana’s score for predicting thrombosis, which emerged in a study of febrile neutropenia in which the key variables for predicting thrombosis were missing, and the response variable was probably not very well collected.

1 Like

Ummm… I found the nomogram on my computer. I did not traslate it.
I think it had a good discrimination ability and was well calibrated.

This will reopen the debate. I think I agree with you, but these categories are not absolute either and there is probably a continuous spectrum between one and the other. Within what would be framed as exploratory there may be analyses based on reasonable assumptions, or pure data torturing. Distinguishing one from the other is based on common sense, critical capacity, etc., rather than statistical expertise. For example, have you read that some astronomers claim to have discovered life in the clouds of Venus? The conjecture is based on an idea of Carl Sagan in the 70s, how skeptical should we be about it? The hypothetical-deductive method has to start from a solid rationale, evaluated critically. The test evaluates the consequences of this rational, in the spirit of Popper. However, the rational arises by induction. What modality of reasoning has been used for the detection of phospine on Venus, inductive or hypothetical-deductive? Well, I don’t see it so clearly, on the one hand they have aimed the telescope simply to see what was going on, on the other hand, they had a good idea of what they had to look at, and what are the essential biochemical processes. The original idea of this thread was to reflect on completely unexpected results, after analyzing an instrumental variable of a registry (the date of birth).

1 Like

That article is interesting. It is curious how easy it is to publish some articles, and how complicated it is to publish others. One solution is simply to apply the Bradford Hill criteria to this type of analysis:

For such purposes, I am also a big fan of structural causal models for oncology, in part because we are blessed in many cases to have co-clinical in vitro and in vivo experimental information in addition to our clinical data. Judea Pearl’s Primer (freely available here) is a fantastic in-depth resource and the 1999 Epidemiology paper by @Sander, Pearl, and Robins is a classic.

No doubt there is a continuum, and multiple dimensions as well. My point was only that in the present context there is no such thing as “hypothesis free” testing, just as there is no such thing as “football-free” football. To be able to apply a hypothesis test means that some kind of hypothesis or model was thought up to test. In your starting example it was simply a hypothesis in which all our priors for the causal effect are concentrated on the null. I pointed out that it does not follow our priors on the association (not the effect) should be as null-concentrated (because there could be confounding by things like season or cultural effects). But there was certainly a hypothesis there.
What you meant I think is that the data are approached without a few pre-specified, precise, contextually targeted causal hypotheses, and instead a lot of mostly doubtful associational hypotheses are tested in the hopes some will pop up as worth pursuing, as in GWAS. This situation is so common that there are whole books largely devoted to it, e.g., Efron & Hastie 2016 https://web.stanford.edu/~hastie/CASI/

2 Likes