Principles and guidelines for applied statistics

Sander · September 7, 2018, 4:39am

Frank suggested that the posts on the topic “What are credible priors and what are skeptical priors?” are getting into a more general open-ended area of philosophy and practice that deserves its own topic page.
These posts would reflect personal statistical principles and guidelines - the hope is that others will share theirs and be open to discussing them.

As a start on what constitutes applied statistics (as opposed to high theory or math stat), here’s an incredible quote Frank has posted elsewhere:

“The statistician must be instinctively and primarily a logician and a scientist in the broader sense, and only secondarily a user of the specialized statistical techniques. In considering the refinements and modifications of the scientific method which particularly apply to the work of the statistician, the first point to be emphasized is that the statistician is always dealing with probabilities and degrees of uncertainty. He is, in effect, a Sherlock Holmes of figures, who must work mainly, or wholly, from circumstantial evidence.” - Malcolm C. Rorty: “Statistics and the Scientific Method,” JASA 26, 1-10, 1931.

– Note well: That’s from 1931 and thus written around 1930, just before intense fighting broke out among between traditional “objective” Bayesians (then called “inverse probability”), Fisherians, and decision-theoretic (behavioral Neyman-Pearsonian) statisticians. As I understand the history, as of 1930 the most prominent split was still between such theoreticians and the old-guard descriptive statisticians. Based on the Rorty quote, one could argue that the triumph of theoreticians over data describers may have set principles of applied statistics back a good 80 years (at least principles for dealing with anything other than perfect surveys and perfect experiments).
Regarding the Sherlock Holmes aspect of inference, Edward Leamer gives a wonderful account in his classic 1978 book on modeling, Specification Searches , available as a free download at
http://www.anderson.ucla.edu/faculty/edward.leamer/books/specification_searches/SpecificationSearches.pdf
or
http://www.anderson.ucla.edu/faculty/edward.leamer/books/specification_searches/specification_searches.htm

Getting to an applied statistics principle: One which has been around for several generations is that dichotomization of results into “significant” or “not significant” bins is damaging to inference and judgment.
Yet, despite decades of laments about dichotomized significance testing of null hypotheses (NHST), the practice continues to dominate many teaching materials, reviews, and research articles.
Apparently, dichotomization appeals to some sort of innate cogntive tendency to simplify graded scales (regardless of damage) and reach firm conclusions, a tendency called “black-and-white thinking” in psychology. Stephen Senn labeled that tendency “dichotomania” in the context of covariate modeling; the pathology is even more intense in use of P-values: Important information is thrown away by dichotomization, and in the testing case the dichtomization further generates overconfident claims of having “shown an effect” because p<0.05 and claims of “showing no effect” because p>0.05.

That brings me back to Dan Scharfstein’s query on what to do about journals and coauthors obsessed with significance testing. What I’ve been doing and teaching for 40 years now is reporting the CI and precise P-value and never using the word “significant” in scientific reports. When I get a paper to edit I delete all occurrences of “significant” and replace all occurrences of inequalities like “(P<0.05)” with “(P=p)” where p is whatever the P-value is (e.g., 0.03), unless p is so small that it’s beyond the numeric precision of the approximation used to get it (which means we may end up with “P<0.0001”). And of course I include or request interval estimates for the measures under study.

Only once in 40 years and about 200 reports have I had to remove my name from a paper because the authors or editors would not go along with this type of editing. And in all those battles I did not even have the 2016 ASA Statement and its Supplement 1 to back me up! Although I did supply recalcitrant coauthors and editors copies of articles advising display and focus on precise P-values. One strategy I’ve since come up with to deal with those hooked on “the crack pipe of significance testing” (as Poole once put it) is to add alongside every p value for the null a p value for a relevant alternative, so that for example their “estimated OR=1.7 (p=0.06, indicating no significant effect)” would become “estimated OR=1.7 (p=0.06 for OR=1, p=0.20 for OR=2, indicating inconclusive results).” So far every time they cave to showing just the CI in parens instead, with no “significance” comment.

Sander · September 7, 2018, 4:52am

P.S. Repeating the post that led to starting this new topic, I offer the following view of from the perspective of someone whose main tasks have been analyzing data and writing up the results for research on focused problems in health and medical science (as opposed to, say, a data miner at Google), which could be titled “On the need for valid contextual narratives and cognitive science in applied statistical inference”:

Contextual narratives I see in areas I can judge are often of very questionable validity. So are most frequentist and Bayesian analyses I see in those areas. Bayesian methods are often touted as a savior, but only because they have been used so infrequently in the past that their defects are not yet as glaring as the defects of the others (except to me and those who have seen horrific Bayesian inferences emanating from leading statisticians). Bayesian methodologies do provide useful analytic tools and valuable perspectives on statistics, but that’s all they do - they don’t prevent or cure the worst problems.

All the methods rely rather naively on the sterling objectivity, good intent, and skills of the producer. Hence none of these approaches have serious safeguards against the main threats to validity such as incompetence and cognitive biases such as overconfidence, confirmation bias, wish bias, bandwagon bias, and oversimplification of complexities - often fueled by conflicts of interest (which are often unstated and unrecognized, as in studies of drug side effects done by those who have prescribed the drug routinely).

To some extent each approach can reveal deficiencies in the other and that’s why I advocate doing them all in tasks with severe enough error consequences to warrant that much labor. I simply hold that it is unlikely one will produce a decent statistical analysis (whether frequentist, Bayesian, or whatever) without first having done a good narrative analysis for oneself - and that means having read the contextual literature for yourself, not just trusting the narrative or elicitations from experts. The latter are not only cognitively biased, but they are often based on taking at face value the conclusions of papers in which those conclusions are not in fact supported by the data. So one needs to get to the point that one could write up a credible introduction and background for a contextual paper (not just a methodologic demonstration, as in a stat journal).

Statistics textbooks I know of don’t cover any of this seriously (I’d like to know of any that do) but instead focus all serious effort on the math. I’m as guilty of that as anyone, and understand it happens because it’s way easier to write and teach about neat math than messy context. To remedy that problem without getting very context-specific, and what I think is most needed and neglected among general tools for a competent data analyst, is an explicit systematic approach to dealing with human biases at all stages of research (from planning to reviews and reporting), rather than relying on blind trust of “experts” and authors (the “choirboy” assumption). That’s an incredibly tough task however, which is only partially addressed in research audits - those need to include analysis audits.
It’s far harder than anything in math stat - in fact I hold that applied stat is far harder than math stat, and the dominant status afforded the latter in the statistics is completely unjustified (especially in light of some of the contextually awful analyses in the health and med literature on which leading math statisticians appear). That’s hardly a new thought: Both Box and Cox expressed that view back in the last century, albeit in a restrained British way, e.g., see Box, Comment, Statistical Science 1990, 5, 448-449.

So as a consequence, I advocate that basic stat training should devote as much time to cognitive bias as to statistical formalisms, e.g., see my article from last year: “The need for cognitive science in methodology,” American Journal of Epidemiology , 186, 639–645, available as a free download at https://doi.org/10.1093/aje/kwx259.
That’s in addition to my previous advice to devote roughly equal time to frequentist and Bayesian perspectives on formal (computational) statistics.

f2harrell · September 7, 2018, 4:30pm

What I would like to contribute regarding principles and guidelines for the practice of statistics is nowhere as deep as Sander Greenland’s wonderful words, and he has given food for thought for various ideas I’d like to come back to in the future. For now I’d like to contribute these practical ideas.

First, I spend as much time as anything on refining measurements, especially the outcome variable Y. Researchers do not have a good understanding about the negative effect of poor precision/resolution in Y, often wanting to go so far as doing the dreaded “responder analysis.” In general, statisticians spend too little time with measurements. Stephen Senn wrote an excellent article about this. Stunningly, he was criticized by other statisticians who felt that measurement is not the domain of statistics. Far too often, statisticians, especially at pharmaceutical companies but also frequently in academia, have completely abdicated their responsibilities to maximize power and precision by taking for granted clinicians’ choices of measurements, not even pointing out the unnecessary tripling of needed sample size in some cases.

I’ve laid out my general principles of my statistical practice here. They are reproduced below.

Use methods grounded in theory or extensive simulation
Understand uncertainty, and realize that the most honest approach to inference is a Bayesian model that takes into account what you don’t know (e.g., Are variances equal? Is the distribution normal? Should an interaction term be in the model?)
Design experiments to maximize information
Understand the measurements you are analyzing and don’t hesitate to question how the underlying information was captured
Be more interested in questions than in null hypotheses, and be more interested in estimation than in answering narrow questions
Use all information in data during analysis
Use discovery and estimation procedures not likely to claim that noise is signal
Strive for optimal quantification of evidence about effects
Give decision makers the inputs (other than the utility function) that optimize decisions
Present information in ways that are intuitive, maximize information content, and are correctly perceived
Give the client what she needs, not what she wants
Teach the client to want what she needs

SameeraDaniels · September 7, 2018, 5:29pm

Great set of insights Sander. I concur.

Thank goodness for that Edit Feature Frank.

jafp · September 7, 2018, 6:03pm

Excellent insights so far.

Applied statistics is something I hold dear and with which I have a few years of experience, even though I don’t identify myself as a Statistician as my degree is in Medicine and I am mostly self-taught and I completely agree with Sander on the thought that it is probably much harder than Mathematical Statistics at least judging from the amount of medical papers I review with glaring mistakes in the methodology.

I am mostly Bayesian nowadays but I still use frequentist methods when the need arises or they are the most useful tool for the problem I am working on but I really don’t believe they are the saviours of the future of science.

Personally I think many mistakes arise from clinician researchers not having a fine command of epidemiologic principles such as causal inference, study design and measurement of variables as well as a good support of proper applied statistical techniques (in my country, most research is still done with t-tests or ANOVAs which ends up leading to the horrible dichotomization that Frank so well comments about) which end up destroying a study even before it is made.

I have learned most about how to avoid those bias (and I still trip on them quite a lot) by reading in depth Epidemiology books and articles rather than Statistics books so maybe the way forward is to reformulate the curriculum in medical schools regarding how Epidemiology and Biostatistics are taught.

Another scary divergence in applied Statistics I would like to comment about regards something I have seen quite a lot lately in my experience: Biostatistics vs Epidemiology, as in many Epidemiologists viewing Statisticians as mere technicians that they call when they need new methods used or developed and Statisticians seeing Epidemiologists as ignorant due to the lack of mathematical background and confidence in following proofs.

I don’t know but from my short experience, I cannot remember a time when proving the CLT theorem or the Gauss Markov Theorem by hand helped me do better research but there were many times where I would have liked to know more about bias, measurement and study design.

I think good applied clinical research can probably only be done with a good intersection between both modern Biostatistics and Epidemiology which should be better taught to physician scientists.

Stephen · September 8, 2018, 8:24pm

This is a very interesting post. Twenty years ago I had a read paper (it was one of a set of four) to the Royal Statistical Society arguing that one had to be careful not to surrender statistics to mathematics. I freely admitted that I wished I knew more mathematics. I think I would be a better statistician if I did. Yet, I also gave a number of example where I felt that concentrating on the mathematics had got in the way of the science. The examples mainly involved the solution (often by some ingenuity) of formal problems that were irrelevant in that they neither captured the scientific essence of the problem being studied nor did they express what the modeller (or anyone else) believed.

Here is a quote from de Finetti that I particularly like and used in that paper:
"In the mathematical formulation of any problem it is necessary to base oneself on some appropriate
idealizations and simplification. This is, however, a disadvantage; it is a distorting factor which one
should always try to keep in check, and to approach circumspectly. It is unfortunate that the reverse often happens. One loses sight of the original nature of the problem, falls in love with the idealization, and then blames reality for not conforming to it’ "

However, I think that Sander is a little too negative about statistics. There are some books that do stress good practice and give useful advice but i agree that there are many that just do theory. In the mathematical statistics course I gave at university, I use to mix in a few examples where simple random sampling (the standard assumption) did not apply but there were various ‘data filtering’ processes at work, just to keep the students on their toes (in amongst all that stuff about sufficiency, consistency etc etc). The students hated it, of course, but it was good for them.

When I worked in the pharmaceutical industry as head of a statistic group I used to make it quite clear to the member of my group that they had to be able to answer “yes” to this question: “Am I really interested in finding out about the effects of drugs?”

I also heartily agree with point 11 of Frank’s. I remember attending a lecture given by a statistical consulting unit in which they stressed the importance of giving the client what they wanted. Being a nasty sort I then asked whether this was the statistical equivalent of what Arthur Andersen did for Enron.

Reference

Senn SJ. Mathematics: governess or handmaiden? Journal of the Royal Statistical Society Series D-The Statistician. 1998;47(2):251-259.

PaulBrownPhD · September 8, 2018, 11:04pm

whatever happened to IDA = initial data analysis? Chatfield wrote a book about it [maybe this is it?: Problem solving] and a paper in '85 [JRSS]. Are junior statisticians still taught this? Back in the day we’d take real care and pride in crafting data listings to extract maximum value out of them. Last time i produced a data listing my clinical colleagues wondered what it was for
@f2harrell said “taking for granted clinicians’ choices of measurements, not even pointing out the unnecessary tripling of needed sample size in some cases” What about the composites promoted by clinicians eg a clinical composite that reduces a number of variables to ‘improved’, ‘no change’, ‘worsened’? I heard a rumour that @f2harrell and @Stephen wrote a blog post about composites but i have never found it
@Sander spoke about p-values and Bayes. What about the new suggestions floating around in the literature eg @DavidColquhoun’s false positive risk, or 2nd generation p-values or the fragility index

frankly it bothers me that we have left it to our non-statistical colleagues to propose the “fragility index” and composite endpoints with very little push back from statisticians. It seems we are hardly in the conversation.

edit: here is a letter we wrote trying to push back against the clinical composite: https://www.ahajournals.org/doi/pdf/10.1161/CIRCULATIONAHA.116.026767

Sander · September 11, 2018, 3:54pm

That was a great post Stephen, although I was not sure which aspect of my posts you thought too negative on statistics. Perhaps your reaction reflects our specialty difference: I think you’ve worked mainly in experiments, where randomization greatly simplifies causal inferences; in contrast, I’ve been almost entirely in nonexperimental research, where formal statistics misleads easily without an explicit causal model to guide it (yet such models remain absent from most statistics textbooks). That ties in to our earlier divergences, of which there are very few: For example, odds ratios become most problematic and propensity scores become pivotal under potential-outcome and related causal models.

ADAlthousePhD · September 11, 2018, 5:58pm

Thanks to the experts who have taken the time to post here and start this thread.

Sander:

That brings me back to Dan Scharfstein’s query on what to do about journals and coauthors obsessed with significance testing. What I’ve been doing and teaching for 40 years now is reporting the CI and precise P-value and never using the word “significant” in scientific reports. When I get a paper to edit I delete all occurrences of “significant” and replace all occurrences of inequalities like “(P<0.05)” with “(P=p)” where p is whatever the P-value is (e.g., 0.03), unless p is so small that it’s beyond the numeric precision of the approximation used to get it (which means we may end up with “P<0.0001”). And of course I include or request interval estimates for the measures under study.

Only once in 40 years and about 200 reports have I had to remove my name from a paper because the authors or editors would not go along with this type of editing.

I have tried a similar approach, with middling success, although that was at least partially a function of a very statistician-unfriendly environment with a few bad actors at the top of the pyramid. Some authors are willing to go along with this, albeit often with hesitation; others balk at presenting any p-value without a statement about significance, and simply re-insert their statements about the significance of effects after I have removed them. I lay the blame for this at least partially on the journal and editorial process; there’s a self-perpetuating cycle where reviewers demand statements about significance, journals pass on those review comments unchecked, authors then believe that they must make a statement about the “significance” (or even worse the dreaded “trending towards significance…”) and they return to me stating that they feel they must comment on “significance” of results because reviewers won’t like it, or it makes the paper seem less interesting, etc. If the editorial process did a better job letting authors know that it was okay, even preferred, to eschew statements about significance and simply present their effect size, CI, and a p-value if desired, I think this would be more widepsread.

While I appreciate your skepticism about Bayesian analyses if not performed correctly, one of my perceived advantages of Bayesian reporting would be the natural replacement of NHST p-values with a posterior probability that seems more intuitively useful in context of the clinical question. Isn’t it more appealing to report results as “based on the observed difference between groups, there’s a 93% posterior probability that Drug > Placebo” as opposed to “there was a 7% chance of the observed difference between groups if Drug = Placebo” (numbers made up, of course)

*I say this as a trained frequentist statistician who is attempting to learn some Bayesian on the fly, albeit only in theory as yet, not many chances to bring it to application just yet

**I know the above interpretations are a little oversimplified

This fascinates me, but I think I understand what you’re saying. Merely pumping out statistics to write papers is relatively easy (as I’ve learned early in my career, being in environments where production of large amounts of papers with whatever data could be got was prized over careful design and thoughtful analyses). Doing applied stats well is extremely hard,

I think statisticians are still taught this; IMO, usually people who don’t spend much time working with data that usually tend to gloss over the need for EDA, or think I’m wasting time by looking at descriptive statistics and distributions of all key variables in the dataset. They just want to know if the p-value for their main comparison is significant; why waste time looking at some boxplots and dotplots?

Two reasons for this:

they underestimate the likelihood that the data contains errors which can be turned up by EDA (I’ve seen people analyzing negative survival times that were clear data-entry errors; “9999” values that are supposed to indicate missing being treated as real data; and other equally ridiculous stories)
they underappreciate the need to understand underlying distributions of data before performing analyses to have an accurate model

Anyone that’s chatted with me here or on Twitter knows how much I rant about this, but it’s true: many clinical investigators think that statistical analysis is a one size fits all recipe from a cookbook: here’s Table 1, here’s a multivariable regression, these are the p<0.05’s, have a nice day.

Sensible use of composites can be beneficial, especially if they’re analyzed as ordinal outcomes rather than just lumped together (sadly, in the cardiovascular literature, they’re often just lumped together as Death / MI / Stroke or something similar - you alluded to something similar in your linked piece with the “any versus none” composite).

Surely, our own backgrounds all inform the way we comment on such matters. I think this is a good point. nonetheless.

pbstark · September 11, 2018, 6:43pm

At Sander’s suggestion, I’m posting my list of guidelines for young applied statisticians.

Cheers,
Philip

Consider the underlying science. The interesting scientific questions are not always questions statistics can answer.
Think about where the data come from and how they happened to become your sample.
Think before you calculate. Will the answer mean anything? What?
The data, the formula, and the algorithm all can be right, and the answer still can be wrong: Assumptions matter.
Enumerate the assumptions. Check those you can; flag those you can’t. Which are plausible? Which are plainly false? How much might it matter?
Why is what you did the right thing to have done?
A statistician’s most powerful tool is randomness—real, not supposed.
Beware hypothesis tests and confidence intervals in situations with no real randomness.
Errors never have a normal distribution. The consequence of pretending that they do depends on the situation, the science, and the goal.
Worry about systematic error. Systematically.
There’s always a bug, even after you find the last bug.
Association is not necessarily causation, even if it’s Really Strong association.
Significance is not importance. Insignificance is not unimportance. (Here’s a lay explanation of p -values.)
Life is full of Type III errors.
Order of operations: Get it right. Then get it published.
The most important work is often not the hardest nor the most interesting technically, but it requires the most patience: a technical tour-de-force is typically less useful than curiosity, skeptical persistence, and shoe leather.

Sander · September 11, 2018, 7:50pm

Thanks ADA for the perceptive comments!
One of your comments raised concerns for me:
“One of my perceived advantages of Bayesian reporting would be the natural replacement of NHST p-values with a posterior probability that seems more intuitively useful in context of the clinical question. Isn’t it more appealing to report results as “based on the observed difference between groups, there’s a 93% posterior probability that Drug>Placebo” as opposed to “there was a 7% chance of the observed difference between groups if Drug = Placebo” (numbers made up, of course)”
Yes that’s a perceived advantage of Bayes, but “perceived” does not mean genuine. “Numbers made up” is the matter of course for Bayes since one has to make up a prior. My biggest fear for Bayes is the tendency to take phrases like “there’s a 93% posterior probability …” as if 93% were a singular quantity: There’s an infinite number of posterior probabilities, depending on the prior. So at best you could report “Using our prior and data-generation model (DGM) there’s a 93% posterior probability”; but thanks to the prior that will be contestable (and should be contested) in even more ways than the P-value (which is also sensitive to the DGM).
The prior opens up a whole new avenue for biasing and gaming the results. When told to switch from P-values to confidence intervals (CIs), many researchers mastered the art of hacking the intervals to include the hypotheses they liked and exclude the hypotheses they didn’t like; others simply learned to describe CIs to focus the reader toward desirable and away from undesirable results. With the prior in your control it’s even easier to hack posterior intervals than CIs, especially to include nulls (few question null-centered priors even when they abhor NHST); plus, all the biased-description skills learned for P-values and CIs transfer wholesale to posterior probabilities and intervals.
That said, I have long taught and promoted understanding of Bayesian methods, have published studies that used them, and think Bayesian perspectives on methods are as essential as frequentist perspectives. But while Bayesian methods can be an invaluable supplement to simpler methods, I doubt they are a safe substitute for those methods: As I see it, the problems of human bias that plague analyses and reporting are simply too deep to be addressed by a switch to Bayes or any statistical method.

Sander · September 11, 2018, 8:03pm

I nominate this list to be printed out and posted on every stat instructor’s office door. (I may be dating myself in not saying it should be texted to all stat instructors and students…Phil will hopefully not mind that I recall several of the items from my mercifully brief but deeply instructive stint as an intro-stat teaching assistant for David Freedman.)

PaulBrownPhD · September 11, 2018, 10:26pm

frank harrell has a neat tendency to produce concise maxims: https://andrewgelman.com/2017/01/14/frank-harrell-statistics-blog/

Statistics needs to be fully integrated into research; experimental design is all important

Don’t be afraid of using modern methods

Preserve all the information in the data; Avoid categorizing continuous variables and predicted values at all costs

Don’t assume that anything operates linearly

Account for model uncertainty and avoid it when possible by using subject matter knowledge

Use the bootstrap routinely

Make the sample size a random variable when possible

Use Bayesian methods whenever possible

Use excellent graphics, liberally

To be trustworthy research must be reproducible

All data manipulation and statistical analysis must be reproducible (one ramification being that I advise against the use of point and click software in most cases)

timdisher · September 12, 2018, 12:17pm

Curious to hear your thoughts on frequentist and Bayesian intervals converging (at least in my applications) under diffuse or large/improper uniform priors? Wouldn’t this imply that the p-value should come under the same prior criticism as Bayesian methods (i.e. the frequentist result is just one of the infinite results dependent on prior specification)? Gelman in particular seems to levy this criticism often in regards to priors that assign significant probability to impossible/highly unlikely values.

f2harrell · September 12, 2018, 2:11pm

That raises a lot of issues that I hope will be addressed in a dedicated topic comparing Bayesian and frequentist inference. The fact that p-values have less to contest is in my mind only deferring the question of how to translate the p-value to a decision, which requires IMHO more assumptions that Bayes. And Bayes makes its assumptions apparent. I wouldn’t say “numbers made up” unless the prior is a point mass. More to follow elsewhere.

Sander · September 12, 2018, 5:22pm

Thanks Tim for the questions. What follows also prefaces my response to Frank’s reply. Since it’s so long I’ll give you this short answer: Yes, all statistical methods share some points of criticism to the extent they share assumptions and derive inferences from them using the same deductive chains. But no, they diverge (often vastly) in the information content of their assumptions and deductions.

To understand where my answers are coming from, I want to belabor this point: A chronic problem in these frequentist-Bayesian debates in that they get framed as if it’s some either/or choice, or as if doing an analysis requires some religious commitment to some philosophy of inference. To repeat again the old comeback: A skilled carpenter knows how to join wood using nails, screws, other (dowels, glue, bolts, etc.), knows the strengths and weaknesses of each method, what needs each meets, and thus when and where to use each - and can use them in combination. A lot depends on the underlying material (softwood? hardwood? composite?), environment (outside? freezing? humid? dry? inside?), purpose (load-bearing? shearing? decorative?). Same for applied statistics: frequentist, Bayes, likelihoodist and other methods each have strengths and weaknesses. Their utility depends on the underlying material (survey? administrative data? data from a trial?), environment (narrow academic? engineering? politically sensitive?), purpose (speculative evidence? financial decision? quality control? litigation?). The endless branching of important contextual considerations makes sweeping academic and philosophical commitments and claims (including mine) look, well, academic. (And none of this is original to me; e.g., Box and Cox among others had clarified it all in the 1970s, before I finished my education.)

With that preface: A major flaw I see in these debates is exclusionary thinking, which results in users trying to squeeze answers out of methods that aren’t suited for the question (it’s like using only screws when you need shear strength or only nails when you need tensile strength). Considered in isolation, frequentist P-values (FP) measure only predictive errors and thus are only refutational measures (forming an extension of modus tollens in logic): They simply reflect how far the data fell from the model used to compute p; they are NOT support measures. Comparing them to posterior probabilities (PP) is dangerous because FP are not PP in concept and don’t even address the same question. PP can measure model support (forming an extension of modus ponens), but to do so they have to explicitly and severely restrict the space of models being considered (and so are only measuring relative support). That is where the prior comes in and is where the extra weakness of PP lies - the prior introduces extra assumptions and supplies extra researcher degrees of freedom, which can be used or misused; and unless constructed directly from frequency data as a frequency model (as in empirical Bayes) it also alters the meaning of the output. With that view, the frequentist-Bayes distinction might be better described as one of falsificationist vs. confirmationist goals.

It just so happens that FP can often be used to numerically bound PP. That connection is useful to understand, but that does not make FP comparable to PP in concept. What we can say universally is that any assumption used to compute any statistic is a potential weakness in any inference based on the statistic. So an assumption used to compute both FP and PP (e.g., that the analyst did not fish out the desired result) is a weakness of both. But that does not make them inferentially comparable. FP is inherently weaker in the precise way favored by falsificationists: It can only weigh against the assumptions used to compute it by being very small (and the information content of FP is better grasped by taking its negative base-2 log). Large FP says nothing except that this particular FP missed whatever defects are present in the assumptions. PP is more ambitious: Large PP assert some static condition is probable given the assumptions. Following the NFLP (No-Free-Lunch-Principle), a user of PP buys these strong assertions by paying the price of asserting (rather than just testing) the assumptions used to compute PP.

Now Bayesians rightfully object that frequentists do the same thing, and that’s true: frequentists do want to make positive assertions and so commit the inversion (prosecutor’s) fallacy informally. That’s easy to spot whenever someone claims data support the null because the null FP was large (not only benighted researchers do this - just check out statements of null-motivated statistics professors, e.g., as lamented in “Null misinterpretation in statistical testing and its impact on health risk assessment,” Preventive Medicine, 53, 225-228.).

With all that, I think I’ve earned the indulgence of a strident coda: I can already hear all the technical objections some will raise to the loose descriptions I just gave. They’ve all been addressed in excruciating detail many times in the past, and leave untouched the fact that applied statistics needs to wean itself off the distortive debates of academic statistics and philosophy. Some still write as if magic beliefs and formulas (like falsificationism or Bayes’ theorem) will cure the cancer of cognitive bias that can and often does afflict every stage of research. But we can never find anything nearing a foolproof or singular approach to either health or inference problems. We can only do our best to promote health with multiple nonexclusive preventive and treatment strategies (some of which, like extreme diets, will turn out to be harmful); and we can do our best to promote accurate reporting and inferences with multiple nonexclusive design and analysis strategies (some of which, like NHST, will turn out to be harmful). Only time will tell what works where and when, and the falsificationists have a major point in that it will do so mainly through trial and error.

Sander · September 12, 2018, 6:37pm

“The fact that p-values have less to contest is in my mind only deferring the question of how to translate the p-value to a decision, which requires IMHO more assumptions that Bayes.”

Sorry Frank, while I agree about “deferring,” IMHO the idea that frequentist decisions require more assumptions than Bayes decisions is just fallacious: The same assumptions will be needed for the same rational decisions with either approach when using the same criteria for “rational”. They just tend to default to different criteria (e.g., idealized Bayes uses one-step ahead coherency under perfect specification; frequentists use repeated-sampling criteria which may vary with the problem). The only question is how transparently the assumptions will be framed. Most transparent I think will be to show both derivations. If you can’t make them match, they are based on different assumptions or criteria, and those differences needs to be brought out.

Bayes wraps many assumptions needed for decision into the prior (e.g., where did you get all those shape assumptions?). That prior is supposed to be an analysis-design or protocol function, determined independently of the analysis data (unless you do empirical Bayes). The resulting posterior uses all those assumptions plus all the data-generation assumptions used for the P-value. In addition Bayes decisions need a loss function. The usual oversimplification of behavioral (Neymanian) frequentism tries to compress all that into decision cutpoints; I’m no fan of that as it can hide and produce severe irrationalities (and can be totally irrational if one always defaults to the null to test; even Neyman warned against that), but that’s not the P-value’s fault. Rational behavioral (Wald) frequentism instead parallels idealized Bayes: It needs a loss function, and Wald’s theorem shows you won’t get “optimal” decisions out of that approach unless you use a prior. So a well-informed and rational decision-maker will end up facing the same problem and process: How to quantify in the prior and model the contextual information about what is going on, and how to quantify a loss function. Both are harder problems than how to crank the math once you have these ingredients specified.

If you agree with that, maybe the big split here is whether we need or should even try for a decision. Why do we have to force a decision out of an analysis? Why can’t you present a P-value function (aka “confidence distribution”) or at least several points on it and leave subsequent decisions alone? That is, you describe the study process and resulting data in detail, and then present your best judgment about what the data aren’t telling us that we want or need to know (and so “embrace uncertainty”). In my applications, the usual conclusion is “if you want as much certainty as you are asking for, you will have to get more and possibly better data (which may be unaffordable or otherwise infeasible)” - and anything more definitive like a decision about the null is just overconfidence (I’ve had exceptions but they are exceptional, with RRs of 50 or even infinity with p<10^-5). Bayes does not solve this core uncertainty problem, although if used as in nonidentified-bias analysis it can illustrate it.

None of that addresses requirements for FDA submissions and the like, but those issues go far beyond the present scope into politics and the like (BTW I have presented hierarchical Bayesian methods for postmarketing surveillance at the FDA).

SameeraDaniels · September 13, 2018, 4:36pm

We face some of these same issues across disciplines. Particularly in discussing appropriate standards of proof to apply in different contexts.

ADAlthousePhD · September 13, 2018, 6:01pm

Oh, boy, this opens a whole 'nother discussion which I am extremely here for.

As the statistician, I would be only too happy to present the relevant probabilities and leave subsequent decisions alone. Unfortunately (as you well know) the end user often wants a decision, an answer, some sort of reasonably confident “yes” or “no” statement about the effect(s) of interest. Is atrial fibrillation a risk factor for readmission in heart failure patients? Are candidates for lung transplant with severe pulmonary hypertension more likely to experience a rejection if they get a transplant? In many cases, I would be perfectly happy to simply provide the best-fitting model for your study question / data and leave it alone from there. Is the end user going to be happy presenting the same? I doubt it.

This is a bit of an aside, but even we statisticians are not always good at grasping or estimating probabilities, never mind people without any quantitative training. I am currently working on a study where patients with atrial fibrillation were asked what they thought their risk of stroke in the next year is - the responses varied wildly, with a bunch of 0’s, a bunch of round numbers like “10 percent” and then some people tossing out “50 percent” or “75 percent” (the annual stroke risk for most patients in the study population is likely somewhere between 0 and 10 percent). So I’m not terribly confident that I can just hand some probabilities to the end user and expect them to interpret correctly.

Sander · September 13, 2018, 9:46pm

Good points ADA. In all regards my original comments must be circumscribed by the environment in which I’ve done almost all my work, where my charge is to explain the information that I see the data as (not) providing, e.g., where the data and context do not provide a sound basis for declaring absolute safety, and only limits on harms can be formulated (with those limits often uselessly high for the decision-makers). My end users almost never expected or demanded more because any decisions were reserved for them or the consumers of the report. In fact I have often seen some statisticians try to inappropriately force decisions on such consumers by declaring their statistics proved harm or safety.

Now if your charge really is to offer a discrete decision about a substantive issue - e.g., recall a product or not - then you are in a much more complex and hazardous setting, and I could only point to my discussion of how frequentist and Bayes decision procedures can converge via Wald’s theorem. I have only seen conflicts emerging from the often vastly different heuristics (incoherent or inconsistent but supposedly cost-beneficial shortcuts) that prevail in the two “schools”. A classic academic illustration of that conflict is the Jeffreys-Lindley paradox, where the Neyman-Pearson heuristic of alpha-level testing using the maximum-likelihood ratio will in large enough samples conflict with the Jeffreys Bayesian heuristic of using a prior point mass (spike) on the tested hypothesis. Though both heuristics have been called brilliant (by non-overlapping subsets of statisticians, of course) I find both distasteful due to the contextual absurdities they entail in soft-science research, and I think both are unnecessary for decision making in light of modern alternatives that instead lead to convergent decisions using explicit loss functions.