Should one derive risk difference from the odds ratio?

My paper clearly states that I don’t have the answer to what effect measure to use in situations where the exposure variable doesn’t have a “default” value (corresponding to “no treatment”). This was also noted in Sheps’ original paper. But you don’t have an answer to this question either. The odds ratio isn’t any better.

In theory, it is possible to define a switch risk ratio using male as the default, or a switch risk ratio using female as default. They will give different predictions. The odds ratio will also give a different prediction. These are three different models, with three different predictions. Neither of them has biological justification. It is completely irrelevant that there is only one member of the class “odds ratio”. The only thing that matter is whether, for the chosen model, you can convincingly argue for stability.

I understand. But the odds ratio does not have this problem, being symmetric (unless someone doesn’t like reciprocals).

Like I said, it is completely irrelevant that there is only one member of the class “the odds ratio”. This means that when we communicate about it, it is possible to say “odds ratio” and unambiguously point to the intended effect measure. But this is just a matter of convenient communication.

A modeller doesn’t first choose from
(1) The odds ratio or
(2) the class of parameters which contains two separate switch risk ratios.

Instead, they consider consider the set of all possible parameters and for each one, ask themselves if there is any reason to expect stability. The fact that the odds ratio is unique within its class doesn’t give it an advantage over either the switch risk ratio that uses male as default, or the one that uses female as default

Further down in my comment I specifically excluded the identity-link Gaussian case because as I said it is no guide to the binomial case: There is far more information in most any continuous Y than in a binomial Y, so in Gaussian-Y cases there can be far more precision in estimating everything including product terms than in the binomial logistic case that is being debated here.

“The number of replicable heterogeneity of treatment effect examples in the literature is extremely small.”
This is no evidence for absence of heterogeneity, in fact it is more evidence that there is little information for making any claim about OR heterogeneity other than it is beyond the reach of our studies. The statement thus looks to me yet another example of trying to spin a positive message (as in “there is usually no important heterogeneity”) out of an experience which like mine really conveys a negative message (that there is usually not enough data information to determine even the direction of heterogeneity). Again this is just second-order nullism, where first-order nullism is the usual NHST fallacy of claiming there is usually no important effect in a topic when in reality the studies had too little data to determine even the direction of effect.

“Some of the best examples are heterogeneity in metabolism but even those have often not translated to differential effects on clinical outcomes.”
I would believe this if you could show me a table of interval estimates for heterogeneity (ratio of odds ratio (ROR) intervals) from all the studies on this topic showing that those were usually confined to a narrow equivalence interval around the null (as done in equivalence testing for main effects). I would bet that instead such a table would only show the same as what I have seen: that almost all ROR intervals are spread far into important heterogeneity in at least one direction if not both.

The way out of this nullistic fly bottle is to accept that studies are rarely designed to estimate RORs precisely; instead they are designed around power for homogeneous ORs, which requires a quarter or less the number of patients than do RORs of the same size. This was a major point in Greenland 1983, Smith & Day 1984, and later studies, and becomes obvious once you compare the variance formulas for the logs of these measures.

In terms of power it looks even worse for most studies. The rare study claiming power to detect heterogeneity looks little better: the 80% standard for the claim sounds high until you realize it means a 20% type-II error rate, 4 times the 0.05 type-I error rate standard (and 40 times the 0.005 standard that some promote) - so the “powerful” study is massively favored to maintain the null even if there is no shrinkage to it!

The honest response to this data deficiency is to admit it and explain that we default to the homogeneous OR model (hORm) because we can say nothing useful about the ROR, so we may as well enjoy the advantages of the hORm such as lower variance in estimating conditional ORs and risks from the model - obtained of course in trade for an unknown increase in bias from model misspecification (that general idea was addressed abstractly in BFH 1975, Leamer 1978, and various articles by IJ Good from that era which opened my eyes to the problem).

5 Likes

I don’t see how this could be considered evidence in any form. In your preprint you take 1% of r0(not_Y) and add that to r0(Y) and then show surprise that the switch RR is constant

I’m tapping out of this discussion, as I am at risk of saying things that would violate the civility rules.

Perhaps the question that needs answering is when we see the OR of interest varying across the strata defined by say gender, at what point do we consider this anything other than random and/or systematic error and how do we determine this?

As a breather, I’ve posted the promised short article. I feel that it may be a way to resolve several of the differences in perspectives. This discussion has taught me this, among other things: when in doubt, don’t oversimplify. The article is here.

5 Likes

Please understand that, in parallel with main-effects analysis, there are volumes of literature bearing on “interaction” analysis in contingency tables dating back at least to the 1930s; see BFH 1975 for a review up to the 1970s to see how vast it had gotten by then. As with most topics it exploded thereafter with computing advances, and there are now scores if not hundreds of answers offered. I doubt if anyone knows them all.

Still, what most researchers actually do is just 0.05-level testing (which predates even my birth), despite the fact that most alternative methods are far better from most perspectives. Sadly, I think the idea that any one of the alternatives will prevail is absurd, especially based on the well-established solution to the problem “How many statisticians must be in a debate for the probability of disagreement to exceed 1/2?” (solution: 2).

Understand also that a good portion of the alternatives reject your question of “whether OR varies” as pointless: the OR always varies to some extent, as do all other measures. Hence your “whether” question should be replaced by “How much does it vary?” As I’ve stated, my current preferred answer would be to produce a contextually defensible estimate of the risk function, and from that provide a compatibility or posterior distribution for the size of any comparison or measure we ask of it. The technology to do this was in place decades ago (computationally it’s beautifully suited for posterior samplers) and by the early 21st century standard software could do it; but there is no incentive to adopt this alternative, and so (as with most alternatives) knowledge of it remains meager and use is rare.

1 Like

This is how I’ve seen the issue.

I find your question about heterogeneity similar to when traders in the financial markets are looking for trading strategies. No one believes the financial markets are 100% efficient; if so they would go do something else.

But it is a competitive environment, and even if one does find some signals that “worked”, other people are crunching on the same price data, so you had better have some external information, or a very rigorous frequentist testing methodology before acting on it. So there is a strong prior on at least some form of weak market efficiency, with most rules having a net 0 (\pm \epsilon) return before transactions costs.

After controlling for obvious sources of heterogeneity, say 5 covariates, how much can treatment x patient interaction explain?

Your post on ROR variance estimation got me thinking, as I don’t have good intuition on how much a difference in variance would matter in the rehabilitation field. Our samples are generally small, the outcome measures are ordinal (often unfortunately analyzed as parametric), and for some conditions (ie. CVA) the patients themselves are heterogeneous.

I’d be happy to have enough signal to determine the sign of the effect reliably. ROR estimation is a very far dream.

1 Like

Yes, that is essentially what I meant, the question being, that some degree of variation is inevitable across strata (though with less spurious reasons for some effect measures than others) for many reasons. You have suggested that methods put forward earlier in this thread are not sufficient to exclude error related variation for decision making so then what are the sufficient criteria in your opinion?
Addendum: By criteria I mean statistical tests like Frank suggested earlier rather than a series of analyses that the average clinical researcher cannot do

Very interesting and perhaps this would help researchers understand their data better but am not clear how this can help clinicians make better decisions in practice. Perhaps its not aimed at them and so we are back to discussing the best summaries for decision making in medicine.

It provides direct, covariate-adjusted estimates of the risks of treatment vs no treatment. It isn’t unusual to have some of the information in a prediction model, but not all of it.

I’m not criterion oriented. As I wrote: If I were making decisions I’d want a whole risk function that could be fed patient specifics to forecast the potential outcomes and expected costs under the available treatment options. AI researchers have been pursuing this idea for many decades now, and it is gradually showing some promise in real applications as databases and computing have entered the era of nanoprocessors. There, the criteria are embedded in the program and adapt as new data come in.

But back to the vacuum-tube era statistics with which most of the medical world still operates: If criteria are used they need contextual justification. For example if one is using NP methods then CI or P-functions are needed to visualize uncertainty and any decision criterion needs some error-cost consideration as demanded by the original NPW theory (which was completed by 1950). For example see the article “Justify Your Alpha” by Lakens et al. (over 80 authors) in Nature Human Behavior 2018, and my recent article on frequentist multiple testing, “Analysis goals, error-cost sensitivity, and analysis hacking: essential considerations in hypothesis testing and multiple comparisons”, Pediatric and Perinatal Epidemiology 2021;35:8-23, https://doi.org/10.1111/ppe.12711
Those articles concern main effects but the considerations carry over immediately to higher-order terms - in fact the multiplicity issues carry over even more strongly because there are combinatorially more higher order terms than main effects.

I won’t write more now because this is yet another book-length topic off of the present debate about noncollapsibility.

Yesterday, I said that I was leaving this discussion. After having slept on it, I changed my mind. My work is being publicly ridiculed by a Professor of Clinical Epidemiology who clearly has not made even a rudimentary attempt to understand the structure of our argument. I consider this a shocking violation of academic ethics. This calls for a public response, even if I am unable to do so within the constraints of Frank’s request for civil discourse.

The structure of our argument is that we consider four types of “switches” which determine whether treatment has an individual-level effect:

  1. In no switch is present, treatment has no effect
  2. If a switch of type B is present, treatment is a sufficient cause of the outcome
  3. If a switch of type C is present, treatment is a sufficient cause of not having the outcome
  4. If a switch of type D is present, absence of treatment is a sufficient cause of not having the outcome
  5. If a switch of type E is present, absence of treatment is a sufficient cause of the outcome

If the effect of treatment is determined only by one of these types of switch (in all members of the population), and if the prevalence of that switch is stable between groups, then a characteristic effect measure associated with the switch will be stable. The particular effect measure depends on the type of switch.

We provide a thought experiment to illustrate how this works for the special case of a switch of type B. This leads to stability of the survival ratio. We clearly state that if the effect was determined by a different type of switch, then another effect measure might be stable instead. Then we proceed to show the biological reasons why switches of type B are more plausible than switches of type E for most settings where an exposure increases risk of the outcome.

Nowhere do we express “surprise” that the survival ratio is stable. The purpose of this section is to illustrate how our causal model works, to give the reader an opportunity to evaluate whether it captures their intuition about how the biological system works.

All of this is very clearly explained in the manuscript. Having to repeat it on this forum is a waste of my time, and even worse, a waste of the time of readers. Methodologists will give me a limited “budget” for how much of their attention they are willing to give, and I am not happy about having to spend this on refuting hostile bad-faith criticism on matters that are very clearly explained in the paper.

Its not good practice to personalize people or make threats on a blog - if you don’t like the idea call it nonsense but that also will not stack up without a valid reason. If you feel criticism of ideas = ridicule or violation of ethics or other social problems then why are you posting?

Coming to your idea - it does not stack up on the math regardless of the philosophical backdrop
if r0(Y)=a and r1(Y)=b then you define b as follows in your paper:
b=a+(1-a)0.01
therefore
b=0.99a+0.01
1-b =1-0.99a - 0.01
1-b = 0.99(1-a)
RR(AH)=(1-b)/(1-a) = 0.99(1-a)/(1-a) = 0.99
Thus the math is clear on the point that you predefined RR(AH) = 0.99 so why are you surprised that its stable at 0.99?

3 Likes

What threats did I make??

I did not “predefine” that the survival ratio is 0.99. Rather, I predefined a causal model, and showed that this model implies that the survival ratio is equal to the prevalence of B. The maths involved is simple, and you have summarized it here. The idea is to illustrate the logic of this mode of reasoning, in its simplest possible form. Later, I provide a way to extend this kind of reasoning to settings where there are multiple types of switches interacting with each other.

I also give you a different causal model, based on switches of type E. With that model, you can show (using similar maths) that the standard risk ratio is stable. Then I discuss at length why the model based on switches of type B reflects biology better than models based on switches of type E.

This to me is the most non-controversial part of the discussion and I concur. I’d just add that it is a useful exercise to find out if this process is made worse or better by allowing for non-constant treatment effect on some “reasonable scale”. When the sample size is not large, adding even “real” interaction terms to models may give worse predictions. A very simple example of this was given here. This is just the old bias-variance tradeoff and I wanted to point out that performance of personalization methods is sample-size dependent.

4 Likes

I am converted - will not look at product terms in GLMs the same way again. If you remember, my first post on this blog was about this but at that time this was a bit fuzzy in my mind but its clearer now.

Just to take an example - if we analyze the data example in Sander’s part 2 paper in JCE, the correct OR is 4.5 for treatment and what we get per stratum or by adding a product term is just an artifact of the sample - right?

1 Like

I am not sure I am looking at the same table as you… Are you seriously arguing that in Table 2, which is used to show heterogeneous treatment effects, that the appearance of heterogeneity is just an artifact of small sample size, and moreover, you can somehow infer that the true conditional odds ratio is 4.5 in both strata? Where does the number 4.5 even come from?

This is not what Frank was arguing. He was arguing that even if there is true treatment effect heterogeneity, the bias due to omitting the interaction term may be less problematic than the loss of precision from including the interaction parameter. I am not fully sold on that argument, but this is a fairly standard view in statistics which a reasonable person might argue for. I am genuinely confused about how someone would think this means that any apparent differences between strata have to be an artifact…

My personal view is that trading off bias for precision is “cheating” and gives you uninterpretable inferential statistics. In my opinion, the correct move is to accept that the standard errors will be large unless we can increase sample size. I would rather have an honest but imprecise prediction. But that is an argument for another day, and not really related to this thread