R-cubed: “If one wishes to condition on the data (ie treat data as given), why wouldn’t he/she simply use relative measures of evidence (ie. likelihoods)?”

?: At the end did you mean “(e.g. likelihood ratios [LRs])”? There are other relative measures including posterior odds (POs) which are LRs only under near-flat priors.

That aside, I’m fine with relative measures as part of a toolkit that also includes absolute measures. As an editor I’d have no rigid requirements other than using reasonable tools reasonably for the task set out by the researchers (in practice in my fields that usually means a few P-values or S-values and CIs could suffice, but well used likelihood or Bayesian alternatives could do fine instead). The full graphs behind such numbers are very important for teaching and classroom data analysis, however, because we need to instill everyone with the idea that the numbers are merely road signs indicating where things are located on a much more detailed map of results.

Everyone understands that saying b1 is half as far away as b2 isn’t enough to tell you whether you can walk to b1 or have enough gas to reach b2. Relative stat measures seem to be tricky though because many writers treat them as if more straightforward and informative than P-values, an appearance which I think is an illusion produced by forgetting about the goals and background of the application and all the assumptions needed to compute what are often misleading statistical summaries.

A problem with all these statistical measures is that their size is extremely sensitive to the embedding model (the set of background constraints or auxiliary assumptions), call it A, which is never known to be correct (and never is “correct” in my work, it’s only a “working model”). With two parameter values b1 and b2 we can’t be sure how much a higher likelihood of b1 over b2 is an artefact of A. All we get to say is b1&A diverges from the data less than b2&A according to a certain divergence measure (the likelihood or some transform like the deviance). But it tells us nothing at all about how that depends on A.

The P-values for b1 and b2 give something closely related but different: They are the percentiles where b1&A and b2&A fall along the spectrum of divergences from models of the form b&A. These divergences may come from LR tests comparing b1&A and b2&A to bmle&A (the latter model being at the 100th percentile = maximum). We then see immediately not only which model diverges less from the data (in the sense of having a higher likelihood) but also where the models fall relative to the “best” on this spectrum. Then the S-values convert these P-values into measures of the information against b1&A and b2&A supplied by the respective test statistics.

But again, all these statistics tell us nothing about how the measures depend on A, a caveat that burdens all of statistics (even in particle-physics experiments). Some dependencies we can address using traditional modeling tools (like residual plots, model updating, etc.). Other dependencies are beyond our data’s ability to deal with (like selection bias); tools for that have only come to wider attention in the present century (in the form of sensitivity and bias analysis, nonidentified models, nonignorability etc.).

As footnote of sorts, here’s a belabored analogy:

LRs are like describing a pair of final exam results b1 and b2 in a math class (call that A) by saying b1 scored 25% higher than b2. Fine - but does that mean b1 scored at 5/40 and b2 scored 4/40, or that b1 scored 35/40 and b2 scored 28/40? Even if we knew the max score bmax was (say) 36, what would the scores relative to that tell us of their performance? We might feel less clueless if we saw where the scores fell in percentiles (distributions) for the whole class, which are analogous to P-values.

But then what if we wondered about what class performance meant for their real-world skills applying math as an RA on our project? What if the class A is totally abstract measure theory involving no computation or scientific word problems? Or what if instead it is numerical analysis? Or …? Without some context about the connection of the content of the class to the real-world work assignment we’d remain in the dark about the implication of class-performance statistics for answering application questions (not unlike some math statisticians have been about the real-world meanings of P-values, LRs, posterior probabilities etc., even though they know the underlying math in exquisite detail, can provide perfect math definitions, and can prove theorems in abundance).