To explore non hypothesis-driven statistical analyses, I have evaluated the prognostic effect of the zodiac sign on the overall survival, in a real database of >2500 cancer patients:
> library(DescTools)
> dat$zod<-Zodiac(dat$date_birth, lang = c("engl"), stringsAsFactors = TRUE)
> survdiff(Surv(OS,Die)~zod, data=dat)
Call:
survdiff(formula = Surv(OS, Die) ~ zod, data = dat)
n=2473, 97 observations deleted due to missingness.
N Observed Expected (O-E)^2/E (O-E)^2/V
zod=Capricorn 247 203 195.667 0.274836 0.304873
zod=Aquarius 214 187 169.646 1.775170 1.942353
zod=Pisces 215 183 176.597 0.232181 0.254755
zod=Aries 238 187 209.565 2.429634 2.714388
zod=Taurus 196 158 171.088 1.001254 1.096064
zod=Gemini 215 180 157.237 3.295255 3.583045
zod=Cancer 222 185 191.979 0.253737 0.282298
zod=Leo 167 139 114.379 5.300093 5.657195
zod=Virgo 226 190 179.263 0.643120 0.706881
zod=Libra 182 158 162.612 0.130789 0.142481
zod=Scorpio 178 145 148.864 0.100272 0.108424
zod=Sagittarius 173 142 180.104 8.061506 8.890433
Chisq= 23.8 on 11 degrees of freedom, p= 0.01343
To my surprise, I observed that the log-rank test is significant (p=0.01), similar to the likelihood ratio test.
I wanted to ask for some insight on the type I error & the behavior of these tests, depending on the degrees of freedom, and the causes of this result.