Predicting survival of cancer patients using the horoscope: astrology, causal inference, seasonality, frequentist & Bayesian approach

By the way, a fellow oncologist wrote me to complain that we’ll have them testing out outlandish hypotheses from small P-values, or from subgroup analyses until the end of times.

I don’t think this result calls out for reproduction yet, because the data haven’t yet been analyzed using a model grounded in a theory that would predict reproducibility. I suggested one way to do that.

To elaborate, I’m going to give you my version of a neo-Fisherian/post-Popperian take which parallels the usage of P-values in the Higgs boson experiments. Those experiments were set in the opposite extreme of the present case, insofar as most everyone was sure the background theory was correct and yet they wanted to put it through a severe test - severe in the physical sense of demanding extremely expensive precision equipment and personnel (not “severe” as an absurd malapropism referring to a non-null P-value).

Ideally, the theory would be a causal mechanism on which to build up a statistical model form. Regardless, only if an analysis grounded in such a theory produced a lot of information (= tiny P-value = large S-value) against the null in the direction predicted by the theory (thus corroborating the theory) could one say further pursuit could be worth the effort. Otherwise, without a theoretical derivation of the statistical model used to derive the test (whether from the standard model of physics or some astrological tale), one has to ask: what is the likely yield or value of pursuing some randomly observed association?

7 Likes

As a counterpoint to physics, what now concerns many oncologists such as @RamonSalazarS and others is the multiplication of dubious results not generated by solid hypotheses, and not validated, by the rise of omics. We don’t have billions of patients like physicists; we don’t look so much for absolute truths through severe tests, as Deborah Mayo explains, but for practical ways to improve the day-to-day lives of patients. That is why my feeling is that sometimes we get caught up in the scale of the philosophical discussion, applying concepts that would be very valid in the field of theoretical frontier physics, to the mundane realm of daily clinical practice.

On the other hand, evaluating temporality is interesting as a confounding factor, but can also be of interest in itself, even if it is to say that it does not exist or has a minuscule effect.

I see no “counterpoint to physics”, because (like physicists and everyone else) oncologists will only spend their constrained resources on associations whose pursuit they expect to pay off. This means they won’t be pursuing astrology (except perhaps as a parlor game) because the stars are much too far removed from tumor genesis, and won’t pursue gestational season either because the outcome (while not as far as from the stars) is still in their minds too far removed from that factor (as encoded in their accepted theory, as you described).

I feel like I have not been clear enough so let me recap how I see our discussion: The issue I think perhaps you meant to raise at the start is what to make of small P-values when there is no theory to explain such an observation (at least, no theory we aren’t 100% sure is false like astrology). My counterpoint was in part this: Why were you examining the association at all? It was in fact because you had some ill-formed notion of some theory to test that you were sure was false, namely astrological theory; and you thought that redividing the year into horoscope months would provide a test of that theory which you expected to be falsified by a high P-value (which is itself the reverse of Fisher’s logic for using P-values). I pointed out that (1) your P-value did not test astrological theory because you did not derive any prediction (statistical model) from astrology or for that matter from any theory at all, and that (2) an observed P-value could have a myriad of other explanations, which even if seemingly implausible were not absurd like effects of stars; I gave one example: season. So, all you tested was an unordered-categorical statistical model with category labels derived from astrology. I went on to describe how to test astrology if that was someone’s actual intent, allowing for unknown confounding by season.

If we now discard all that theory, we can return to the question: What to make of a P-value with no derivation from a background theory? Very little, almost nothing; I’d say just this much: According to a particular measure of distance called the Shannon or binary surprisal or S-value, and making no use of any structural information like time ordering of dates, the data you gave are about 6 or 7 bits away from what is predicted by the null model in which the underlying outcome means are uniform over birthday [bits are tiny units so let’s not fuss over decimals; each one represents the maximum information we could gain from one binary observation, e.g., the maximum information that we could gain from one toss about loading for heads in a coin-tossing mechanism]. Without more input, that’s the end of the story supplied by statistical theory. And anything more (besides more of the same sort of data) has to be supplied by the context, which is what determines whether this analysis is in retrospect a trifling diversion like a horoscope, or the lead into the greatest breakthrough of the decade in oncology.

6 Likes

Your response conveys very stimulating ideas. I can’t help but dig a little deeper into the s-value. I have two doubts. Shanon’s formula, relating information to entropy, is relatively easy to understand: microstates increase exponentially in relation to the minimum information needed to describe them. However, I do not understand the intuition to expand that concept to p-values. Since some readers here may not be clear about it either, perhaps it would be nice if you could explain it in a simple way, as intuitively as possible. What makes the p-value somewhat similar to the number of microstates in Boltzmann’s equation? On the other hand, and as a result of @R_cubed previous comment I was asking a colleague if 6-7 bits were too much or too little, looking for analogies with worldly issues. We did not reach a clear conclusion. The idea is beautiful, but how do you interpret the number of bits to know if 6 are too little?

Answering that gets into a big topic area…

Various transforms of the observed P-value p such as 1−p, 1/p, and log(1/p) have been used as measures of evidence against statistical models or hypotheses (which are parts of models). Part of the interpretation problem is that those statistical models get confused with the theories that predict them. Correct understanding however requires keeping them distinct, because the measures refer only to the statistical model used to derive the p-value; purely logically the measures say nothing about any theory unless that model can be deduced from the theory. Thus (as Shannon pointed out) the information they measure is only information in the narrow syntactic sense of data deviation from statistical prediction; they connect to semantic information (contextual meaning) only to the extent that information gets encoded in the theory in a form that enters into the deduction of the statistical model from the theory. You may see my earlier comments as stemming from this distinction. [Parallel comments apply to other statistical measures in other systems, such as likelihood ratios and Bayes factors, which have huge literatures including math connections to P-values - but again those literatures often confuse the statistical model (which may now include an explicit prior distribution) with the causal network or theory under study.]

The idea of re-expressing P-values as surprisals via the S-value s = log(1/p) = −log(p ) transform goes back (at least) to the 1950s, using various log bases (which only change the unit scale). The transform also arises in studies of test behavior under alternatives, with ln(1/p) as an example of a betting score and a “safe” test statistic.

I’ve authored or coauthored several articles trying to explain the S-value’s motivation and interpretation from a neo-Fisherian (refutational statistics) perspective. The following should be free downloads, and may most directly answer your question:
Greenland, S. (2019). Some misleading criticisms of P-values and their resolution with S-values. The American Statistician , 73, supplement 1, 106-114, open access at www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1529625
Rafi, Z., and Greenland, S. Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. BMC Research Methodology , in press. http://arxiv.org/abs/1909.08579
Greenland, S., and Rafi, Z. To aid scientific inference, emphasize unconditional descriptions of statistics. http://arxiv.org/abs/1909.08583
See also this background essay against nullism, dichotomania, and model reification:
Greenland, S. (2017). The need for cognitive science in methodology. American Journal of Epidemiology , 186 , 639-645, https://academic.oup.com/aje/article/186/6/639/3886035.
More quick, basic treatments of key topics in the above articles are in:
Amrhein, V., Trafimow, D., Greenland, S. (2019). Inferential statistics as descriptive statistics: There is no replication crisis if we don’t expect replication. The American Statistician , 73 supplement 1, 262-270, open access at www.tandfonline.com/doi/pdf/10.1080/00031305.2018.1543137
Greenland, S. (2019). Are “confidence intervals” better termed “uncertainty intervals”? No: Call them compatibility intervals. British Medical Journal , 366 :15381, https://www.bmj.com/content/366/bmj.I5381.
Cole, S.R., Edwards, J., and Greenland, S. (2020). Surprise! American Journal of Epidemiology, in press. https://academic.oup.com/aje/advance-article-abstract/doi/10.1093/aje/kwaa136/5869593

8 Likes

P.S. Re: “how do you interpret the number of bits to know if 6 are too little?”
Without extensive context, that question is far too ill-posed to answer, yet conventional stat training and practice pretends otherwise. The problem is that, as used in such questions, qualitative terms like “too little”, “large”, “very small” etc. are contextual valuations that refer to conditions beyond the study results, and can go far beyond even the bare scientific context.
Consider: How do you interpret the number of grams to know if 6 are too little, or too much? You have to narrow the context quite a bit you to answer that. Is that question about daily nutrient intake? Then 6 grams is far too little protein to maintain good health, but far too much salt.

With statistical tests, is 6 bits of information against a hypothesis too little or more than enough to pursue further study? Well how much would a further study cost? 1 hour and $100 as in replicating a very simple psych experiment in a lab already set up for it? Or 10 years and $100 million as in replicating certain long-term clinical trials?

The need for a loss or cost function for value declarations is well recognized in statistical decision theory but fatally neglected in most basic stat education and publication, where instead the magical “p < 0.05” = “s > 4.32 bits” is enforced as a universal rule. This practice is vigorously defended and adhered to by prestigious medical journals and opinion leaders, despite the fact that the founders of current conventional statistical testing theory and practice (Fisher and Neyman) warned against such nonsense - and despite plain demonstrations of how nonsensical it is.

This kind of convoluted defense of utter nonsense is exactly what one should expect when the defenders have built their careers and policy decisions using and enforcing the nonsense. Resistance will be especially fierce among those who face enormous liability if their practices and policies are revealed to be damaging (read about the resistance to antiseptic procedures in the 19th century, or to dietary origins of obesity and diabetes in the 20th century).

Although there were always objections to such universal decision rules (including again from Fisher and Neyman), of late there has been more objections, some even from defenders of fixed-alpha-level testing. See for example last year’s special issue of The American Statistician, introduced by Wasserstein, Schirm & Lazar at https://www.tandfonline.com/doi/full/10.1080/00031305.2019.1583913

9 Likes

I never thought a thread on the horoscope would be so interesting and productive. Thanks, let’s digest it.

2 Likes

In epidemiology and causality theory, horoscopes can be seen as an example of a negative control treatment - valuable precisely because everyone in the community will accept that they have no effect; hence an analysis making it look as if they do have an effect may be taken as a sign of a problem with the data or the analysis approach (or both).

4 Likes

This is true in so many aspects of statistics. To paint with only a slightly overly broad brush, much of mathematical statistics seems to be motivated by job preservation IMHO. I see so many one-off frequentist solutions to problems that could have been handled much more simply, but without getting a paper out of it. I’m thinking also of the field of large sample theory and asymptotics.

3 Likes

Yes indeed. I’ve heard similar comments from some top honored stat theory figures (one placed about 80% of what is published in Theory & Methods is like that). And a lot of it is reinventing wheels using new hubcaps. Of course top applied statisticians like you and Box have long complained about that. It seems to be a product of the perverse incentives for academic careers, where mathematical sophistication is mistaken for scientific sensibility and relevance. I think some of Sabine Hossenfelder’s complaints about theoretical physics are similar, e.g.,
On string theory: https://www.youtube.com/watch?v=6RQ6ugMWZ0c
On current cosmogony: https://www.youtube.com/watch?v=VHhUCav_Jrk
On many worlds: https://www.youtube.com/watch?v=kF6USB2I1iU

3 Likes

To be mentioned even on the same web site (not to mention the same page) as Box is quite an honor …

4 Likes

Well, I kept thinking about the problem.
I have calculated the day within the year in which the subject was born (Days_b, from 1 to 365), and I have divided it by 365 to obtain the fraction. I have learnt that the amplitude and phase of a sinusoidal function A * sin (x + phase) = a * sin (x) + b * sin (x), which helps to fit a regression equation (phase & amplitude may be solved from model coefficients). As a confounder I have added the year modeled in a non-linear way. The complete zodiac is problematic because it induces multicollinearity (VIFs >40).
Therefore, I followed @RonanConroy proposal and grouped them according to the four elements (water, fire, air, earth).
After that, no apparent seasonality, no prognostic effect associated with the zodiac (sorry!).

> fit<- coxph(Surv(time,event) ~ sin(2*pi*Day_S/365)+cos(2*pi*Day_S/365)+elements+rcs(year,3),data=data)
> fit
Call:
coxph(formula = Surv(time,event) ~ sin(2 * pi * Day_S/365) + cos(2 * 
    pi * Day_S/365) + elements + rcs(year, 3), data = data)

                          coef   exp(coef)  se(coef)      z     p
sin(2 * pi * Day_S/365)  0.005699  1.005715  0.032472  0.176 0.861
cos(2 * pi * Day_S/365) -0.025517  0.974806  0.033196 -0.769 0.442
elementsEarth           -0.037949  0.962762  0.071303 -0.532 0.595
elementsFire            -0.070279  0.932134  0.073149 -0.961 0.337
elementsWater           -0.058129  0.943528  0.071917 -0.808 0.419
rcs(year, 3)year        -0.006045  0.993973  0.004463 -1.355 0.176
rcs(year, 3)year'        0.007015  1.007040  0.005497  1.276 0.202

Likelihood ratio test=3.95  on 7 df, p=0.7859
n= 2473, number of events= 2057 
   (97 observations deleted due to missingness)
> vif(fit)
sin(2 * pi * Day_S/365) cos(2 * pi * Day_S/365)           elementsEarth            elementsFire           elementsWater 
               1.083581                1.126406                2.049739                2.387306                1.985137 
       rcs(year, 3)year       rcs(year, 3)year' 
               5.978774                5.963942 

What do you think? Mystery solved? Is this a good test as requested by prof @Sander ?

P.S. this may be overthinking the issue since the four elements alone are not either significant !

P.S… Do I need to explore >1 cycles per year?

3 Likes

[Typo: your first b * sin (x) should be b * cos(x)]
Yes a * sin (x) + b * cos(x) is an easier way to fit seasonality. Were this a serious analysis of seasonality you would then have to transform the stats for a and b into stats for amplitude A and phase. But no need for that here and your approach is better than what I suggested.

I can’t quite understand your model code: I missed what is rcs(year,3) and why is it in the model? What happens if you remove it?
Assuming that’s not an issue (and I think it’s not), your results are exactly like we all expected, and so I think your analysis is now a teaching demonstration of how initially seemingly odd results using what is in fact an odd overparameterized model (unordered months, inappropriate for examining either season or astrology) can evaporate or be explained away once one uses actual prior structural expectations; i.e., a model based on what the background context tells us what we ought to see IF there is any effect of these variables. With that, the model has now tried to be fair to both seasonal concerns and astrology. Hence I’d be most curious if anyone disagrees (even astrologers).
There is however one item I didn’t see that I’d want to see in any serious topic analysis: A test of model fit. If there is one available in your proc, perhaps you can run it just to show there is no obvious problem there. Again maybe this should be done without rcs (whatever that is) in the model.

3 Likes

rcs is restricted cubic spline also known as natural spline, using a truncated power basis. This is done by the R rms package. rcs(x, 3) means 3 default knots (points of discontinuity of the 3rd derivative) that have equal sample size between any two knots. With 3 knots rcs uses default knot placement, at the 0.05, 0.5, 0.95 quantiles of the predictor. This captures long-term time trends.

Thanks Frank, that clears up what the terms are but I’m still not following why rcs(year,3) in the model.

First, why is it used as opposed to say rcs(secular,3)? where “secular” is full calendar time running from 0 to end of study and increasing by 365 or 366 across each year (i.e., it’s not date within year). If “year” is literally just year then the model is implicitly forcing jumps at year boundaries, constraining them to follow an rcs across years only; but if the concern is to capture evolution of treatment and case mix, I’d expect neither to jump at Jan. 1 and nowhere else. Second, I would have expected any secular trend over a restricted period to be monotone, which the rcs does not enforce. So if the goal is simply to control confounding by secular trends I would have used up the two parameters (model df) for that quite differently. [Not that any of that matters in this example given the rcs coefficients are practically zero, so there appears to be no secular effect; but I’m treating it as a teaching example.]

Finally and most of all, I can’t see how “year” could be a confounder in this example because I expect the distribution of dates within each year to be practically constant across years, naturally preventing confounding of “date” by “year” and thus eliminating the need for “year” in the model even if (unlike here) it had some notable effect. One could even look at the data to check this independence prior (that being ancillary for the outcome regression, such a check does not invalidate the latter).

Am I missing something here?

1 Like

I am going step by step with this wonderful exercise.
The amplitude is equal to sqrt (coef1^2 + coef^2)
On the other hand, sin phase = coef2 / sqrt (coef1^2 + coef^2)

Therefore, if I do it correctly it would be:

> sqrt(0.005699^2 + (-0.025517)^2) # amplitude
[1] 0.02614567
> asin(-0.025517/( sqrt(0.005699^2 + (-0.025517)^2)))*365/(2*pi) # phase
[1] -78.48525

That is, a maximum of about day 365-78.4.
Seems right?

Although the phase has a simple interpretation, I am not sure how to interpret the amplitude. The tiny number I suppose is interpreted as little seasonality. However, we have to calculate the confidence interval to those quantities using the covariance matrix (maybe @RGNewcombe can help). I only know how to do this to add up the coefficients. I don’t know if any trick in R allows to do it.
With respect to rcs(year,3) it was to model the year of birth itself. In Spain some years were harder than others. In the years after the civil war there was a lot of misery. The conditions were slowly improving but then there have been several crises, etc.

Anyway, I find it very strange how easy it is to find these trends. For example, I have used a second database, and have repeated the analysis with or without seasonality and secular trends. It comes out the same as in the previous case. I have the feeling that there may be some weird mathematical artifact in the regression with a categorical variable with 12 levels. Maybe a simulation is a good idea as @zad suggested.

Fit 1: no seasonality, no secular trend

Fit 2. With seasonality and secular trend.

The result is quite impressive. I am going for the Cox diagnostics in a moment…

Great points Sander. Speaking generally this is my preferred approach:

  • When the data span multiple years include a long-term time trend using a spline function of year + fraction of a year
  • For cyclic within-year trends without assuming symmetry as sin and cos do, include an additional short-term trend in the model using cyclic splines
1 Like

How can in do that??