Necessary/recommended level of theory for developing statistical intuition

Superb, this discussion is taking a very productive turn!

According to the CME, SPAN is a Value at Risk (VaR) based system.

While I am not immediately aware of failures of SPAN in US options trading (though I will now start looking for them), the use of VaR was implicated in the financial crisis of 2008. For instance, supposedly the failure of Northern Rock shouldn’t have happened based on a VaR analysis (this is an example discussed by Kay and King). K^2 write that

Although value at risk models may be of some use in enabling banks to monitor their day-to-day risk exposures, they are incapable of dealing with the ‘off-model’ events which are the typical cause of financial crises.

Riccardo Rebonato (I thank @R_cubed for pointing me to this author) states that VaR analysis is an “interesting and important measure” of risk “when nothing extraordinary happens”. But of course in our lifetimes markets have experienced “extraordinary” events on multiple occasions…because financial markets are actually “large world” arenas. And as we’ve seen, regulators in 2008 took a very improvisational approach to handling such crises…as far as I know, they did not rigidly follow the prescription of some probability-based framework (and thus, their behavior could not be so modeled by other market participants).

It seems that K^2 and Rebonato are saying that VaR has its honored place in risk modeling, but cannot be depended upon to fully assess risk when market non-stationarity is changing rapidly, which is precisely a “large world” issue. VaR works best when the “large world” system is behaving (temporarily) in a “small world” fashion.

I don’t know of anyone who thinks financial markets are perfectly modeled by simple probability distributions, or that VaR is a complete methodology.

It is fundamental to me that markets are incomplete in the Arrow-Debreu sense: there exist more risk than there are securities to hedge that risk.

It is also an interesting question whether these financial “innovations” end up increasing risk, rather than simply transferring it ie. moral hazard.

A very thorough (open source) text on VaR methods (with utility to other areas of applied stats) can be found here:

My biggest objection (which I will elaborate on later) is their recommendation to compare and contrast different “narratives”. This entire notion of “narratives” has created a cargo cult of “experts” who advise representatives of constituents, who do not communicate information to the people they are supposed to represent, but manipulate them in the fashion of Thaler’s “nudges.”

A detailed critique can be found here:

I essentially agree with this clip from an FT review

The authors of this book belong to the elite of British economic policymaking …Their alternative to probability models seems to be, roughly, experienced judgment informed by credible and consistent “narratives” in a collaborative process. They say little about how those exercising such judgment would be held to account.

This is a fascinating discussion. I would like for us to return more to the original topic.

1 Like

Frank: I think the discussion between Tong and R-cubed is most relevant for the thread topic of the necessary level of theory. They are discussing how theory as conceived in statistics is a very narrow window on reality, constrained as it is by formal mathematical (probability) conceptual models. In health-med sci this theory is often (I’d say typically) oversimplified in a fashion that makes study results appear far more precise than they should, stoking grossly overconfident inferences in the form of understated error rates and overstated posterior probabilities. The question is then how to moderate overemphasis on mechanical mathematics with coverage of heuristics and judgment, an issue raised by those like Tukey and Hamming some 50-60 years ago.

Like misuse of NHST, this theory/reality tension remains largely unaddressed because so many of those tasked with stat teaching know little about how formal theory enters into heuristics and judgment, having themselves been forced to focus on formal theory to have a successful academic career in statistics. Aggravating that is time pressure: Formal theory being so algorithmic makes it far easier to implement in teaching and examinations; whereas good coverage of judgment and its pitfalls requires knowledge of real research areas and historical facts, as raised by Tong and R-cubed, and has far fewer instructor-friendly sources. The question is then: How do we go beyond toy examples to integrate real case studies into teaching, so that students can develop sound intuitions about the proper uses and shortcomings of formal statistics theory and its computer outputs?


At the end of this post, I will (attempt to) tie the following comments to the original topic.

Kay and King are skeptical of behavioral economists like Thaler (who they specifically critique), though this is not obvious from Boissonnet’s book review, which I hadn’t seen until now. Despite this, I recommend Boissonnet’s piece as a better summary of K^2 than I have given on this thread (kudos to @R_cubed for sharing it here). Boissonnet’s principle criticism seems to be that K^2 ignore volumes of academic literature that purport to address the flaws in statistical and economic reasoning that K^2 critique. Boissonnet’s perspective would be more compelling if some of those academic theories he cites had been deployed successfully in the field. This is why I have been asking for concrete examples.

Note that no one expects models to be perfect: George Box said “all models are wrong, some are useful”. Let’s find these useful examples in “large world” problems, as they may help us sharpen our statistical intuition when it’s our turn to cope with such problems. VaR is a compelling example, but has inherent limits, as I’ve argued above, and such limits may lead to serious trouble (eg, unanticipated bank failure). Developing statistical intuition should include the ability to judge when or to what extent probability reasoning/theory is even applicable to a given problem. Even if probability applies to “most things” you may not always choose to use it, just as a physicist does not always choose a quantum description of a system, even though such a description would “always” be applicable in principle.


I’ve been asking for clarification of what “success” means in this context. What gets dismissed as failure or incompetence I’ve come to see as misdirection.

I can tell you (having lived through 3 boom and bust cycles since the mid '90s) that a significant number of people in the financial markets knew there was a tech bubble (circa 1998–2002) and a housing bubble (circa 2005-2008), despite media, Federal Reserve and government declarations to the contrary at the time.

If you looked at the options series on a number of financial stocks (circa 2006-2007), you saw a price implied probability density that was factoring in the possibility of large negative price moves than anyone in positions of authority accepted as possible. This was specifically apparent in the regional banks and subprime lenders.

If you then looked at the financial statements, you saw earnings being propped up by a reduction in loss reserves on a geographically concentrated real estate portfolio, often of questionable credit quality.

Had you pointed this out to “professionals”, you got dismissive, defensive rebuttals. Yet, all of those corps went bankrupt.

Does that count as “success”?

Further Reading

Mizrach, Bruce (2010). “Estimating implied probabilities from option prices and the underlying.” Handbook of quantitative finance and risk management. Springer, Boston, MA, 515-529. (PDF)

At the end of the prologue of Michael Lewis’ The Big Short, he writes:

It was then late 2008. By then there was a long and growing list of pundits who claimed they predicted the catastrophe, but a far shorter list of people who actualy did. Of those, even fewer had the nerve to bet on their vision. It’s not easy to stand apart from mass hysteria–to believe that most of what’s in the financial news is wrong, to believe that most important financial people are either lying or deluded–without being insane.

The success here is by those who may have been tipped off by a probability model (and others who didn’t need a model to raise their suspicion) but then, as @R_cubed notes, proceeded to gather additional information to understand and interpret what the model was claiming, and find out if it made sense (dare I say…a narrative?). K^2 call this process figuring out “what is going on here”. You don’t simply turn a mathematical/compuational crank - you have to understand and interpret the data and the model output (either of which could be misleading), which often requires gathering more and better data, along multiple lines of evidence. The same process is described by David Freedman in his famous “shoe leather” paper, which is heavily excerpted in this longer and very insightful essay (in which many of the ‘good’ examples do not use probability or statistical inference at all):

As we’ve already discussed, there were other failures during the housing bubble crisis that were not evident from the probability models (VaR and Northern Rock’s failure is an example given by K^2). The successful users of probability models should consider them simply as fallible tools to help them think of more questions and provide directions for seeking more information and data, not turn-the-crank machines that provide risk management recipes free of shoe leather and critical thinking. And in this process the role of heuristics and narrative economics have their role. The discussion on this thread has convinced me that the large world/small world heuristic given by K^2 is indeed a valuable one for understanding the limits and hazards of probability reasoning, but there may be others.

1 Like

I just realized I had recommended the Freedman paper back in my May 11 post above; another paper mentioned there that is relevant to the immediate points made above is

I will merely point out that this quote from Freedman (which I take as representative of your views)

For thirty years, I have found Bayesian statistics to be a rich source of mathematical questions. However, I no longer see it as the preferred way to do applied statistics, because I find that uncertainty can rarely be quantified as probability.

is different from this:

The successful users of probability models should consider them simply as fallible tools to help them think of more questions and provide directions for seeking more information and data, not turn-the-crank machines that provide risk management recipes free of shoe leather and critical thinking.

No one (especially not me) argued for the latter. To do so would be engaging in the practice of
“persuasion and intimidation” that Stark properly condemns. Unfortunately, I find much of “evidence based” rhetoric accurately described by his language.

I believe I’ve shown via the example of extracting the implied probabilities from options prices, that even decision makers at the highest levels, in a regime of incomplete markets, find valuable information contained in a subjective, betting odds interpretation of probability that has no physical meaning.

Indeed, I consider precise probability models merely as a point of reference to compute the width of the range of interval probabilities (aka covering priors) worth considering.

Related Thread

1 Like

The cited quotes are different but 100% compatible. Freedman’s statement is not simply a critique of Bayesian probability but of all probability. In Sir David Cox’s reply to Leo Breiman’s “Two Cultures” paper (2001), he wrote:

I think tentatively that the following quite commonly applies. Formal models are useful and often almost, if not quite, essential for incisive thinking. Descriptively appealing and transparent methods with a firm model base are the ideal. Notions of significance tests, confidence intervals, posterior intervals and all the formal apparatus of inference are valuable tools to be used as guides, but not in a mechanical way; they indicate the uncertainty that would apply under somewhat idealized, maybe very idealized, conditions and as such are often lower bounds to real uncertainty. Analyses and model development are at least partly exploratory.

Cox may be too optimistic in light of the issues I raised in my 2019 paper (which many others have raised elsewhere and earlier than me), as well as situations where one is attempting to use models that work in only “small world” scenarios when the non-stationary aspects of the “large world” arena start to manifest. These are examples when formal models can become very disconnected from reality and potentially more misleading than other approaches. On the other hand, numerous examples from the history of science show how data can lead to scientific discovery without calling on a probability model at all, e.g.,

See also the Freedman paper I linked above, and my discussion of Max Planck’s work on so-called blackbody radiation in my 2017 paper with B. Gunter, cited above (I repeat the link here)

I withhold judgment on the implied probabilities method until I’ve looked further into it, but thanks to @R_cubed for providing concrete examples such as this one.

1 Like

I propose a test of the implied probability method. If such a thing as Nickel options exist, simulate the performance of the implied probability method on nickel options for the weeks leading up to March 8, 2022…and then the immediately subsequent weeks.

What exactly are you testing? The actual closure of the nickel market on the LME would at best be a credit default event on Hong Kong Exchange and Clearing derivatives. The CDS market is over-the-counter (OTC) and there is no place to obtain centralized quotes.

It is possible that there are options traded on the parent corp that might have indicated something. There are no options on any American exchanges AFAIK. The problem is the parent corp has a number of exchanges, so the closure of this market might not have been material enough to cause traders to sell the shares or debt.

According to this, there have been 3 clearing house failures in the past 50 years globally.

Can a clearing house go bust?
They can, but it is very rare given there are several safety nets to burn through before that happens.

Using options to predict rare events is a well studied topic in financial economics:

It appears congratulations are in order to @R_cubed : I am not aware of an empirical failure of the option-implied probability method. It may be that it will require another significant financial crisis to truly test its capabilities. I speculate that it must have a failure mode similar to that of VaR, though perhaps it may be a bit more robust to nonstationarity than VaR? However my criticism remains only conjectural at this point. I would still urge users of this method to keep their eyes, ears, and minds open to information that does not simply arrive in the form of securities prices, that might indicate that deviations from whatever assumptions are being made by the method are starting to manifest. Such users should actively and constantly seek such information - wear out some shoe leather - and not just leave the model on auto-pilot.

It is notable that the Harry Crane quote i cited in my May 22 post above, does acknowledge that financial applications fall within the very narrow scope of what he considers fair game for probability calculations. This concession is far more generous than K^2 would allow, but Crane is still awfully pessimistic more generally, as am I.

For now, thanks to @R_cubed for a stimulating discussion and excellent sportsmanship. Hopefully our collective statistical intuition has strengthened as a result of this discussion.

1 Like

In the literature, the implied probability density is known as the “risk neutral density”. Changes in information on actual probabilities or changes in risk preference can affect the output of the model.

There has been much written about the limitations of textbook decision theory in the economics and operations research literature. Textbook decision theory assumes probability estimates are independent of utility. I alluded to it in this post:

I speculate that changes in risk preference (ie. utility) are related to those factors K^2 call “evolutionary” and are hard to put into words. Regardless, IJ Good would consider changes in utility as information.

Being derived entirely from logical considerations, theoretical and empirical option pricing models are a beautiful achievement of the De Finetti - Savage - Radical Subjective Bayesian school of thought (which is closely related to the Cox-Jaynes Maximum Entropy perspective). They deserve greater attention if we wish to place applied stats in an information theoretic perspective.

The true value of implied probabilities is to get an individual to think: “What do I know that has not been factored into the price?” I’ve found the device of betting games and their valuation extremely useful in detecting when the data analysis practices in entire fields of inquiry are so bad as to be called “pathological.”

For Further Study
Harry Crane on Fundamental Principle of Probability

Glenn Shafer on Testing by Betting as a method of Scientific Communication. This was presented a few years before his more formal paper in the Royal Statistical Society.

A betting interpretation is superfluous for many applications of probability. For instance many probabilities calculated in atomic/molecular physics, and condensed matter physics, are simply objective, frequentist probabilities that can be reproducibly measured in the lab. For instance,

Crane’s FPP proposal has (at least) two problems. (1) It commits the fallacy of “naive probabilism” that Crane wrote eloquently about, 2 years after the above-linked presentation. As Crane wrote in 2020,

Real world problems require more than just thinking in bets. Most common sense is
qualitative, thinking about plausibility and possibility long before precisely quantifying
probabilities of any kind. Even in cases where the probabilities can be quantified, they
should rarely be interpreted literally.

Probability calculations are very precise, and for that reason alone they are also of very limited use. Outside of gambling, financial applications, and some physical and engineering problems–and
even these are limited–mathematical probability is of little direct use for reasoning with

I thank @R_cubed for making me aware of Crane’s paper:

(2) FPP smacks of a lack of familiarity with the nuts and bolts of doing actual science. For instance, it is not possible to prespecify an exposure limit (Crane proposes 2 years). The proton mass article I linked above shows that people have been trying to measure it for at least 60 years. Galileo’s equivalence principle that the gravitational acceleration of a body should not depend on its composition, from the late 16th century, has been continuously subject to experimental tests through the present time.

Apollo 15’s astronaut David Scott tested it on the moon!

But no one is willing to say once and for all that Galileo won the bet that Crane would have had him make, as evidenced by the significant investment of money into launching the MICROSCOPE satellite mission discussed here.

A survey of earlier experimental tests is included in the original paper.

(I wouldn’t agree that MICROSCOPE was the “first” space test…though I suppose Commander Scott’s demo wasn’t seriously intended to achieve a level of precision even remotely comparable to the MICROSCOPE mission’s.)

1 Like

Physics is the model that other disciplines aspire to. Physical theories predict the observable world accurately enough, there is no market to be made. Bets can only be made when there is disagreement on value, but agreement on price. In this sense, physics is an “efficient market”, unlike anything in the so-called “social” sciences.

But should someone challenge well established reproducible findings, who would you put money on?

FPP smacks of a lack of familiarity with the nuts and bolts of doing actual science.

If by “actual science” you mean “physics”, it does not have the problems that other disciplines wishing to borrow its credibility have – making claims that do not stand up to empirical examination. This is what the FPP is intended to correct. It follows from information theoretic considerations as I’ve mentioned in a number of other threads. In the areas where there is a divergence of opinion in physics, the challenge is that the economic cost of the procedure which would decide the issue is beyond the budget of interested parties.

Your point about the arbitrary time limit for the exposure limit is a reasonable criticism. But one experiment of a short duration shouldn’t decide the issue. Nothing would stop others from betting on longer time frames. In this case, the information is unknown not disputed.

In the proton radius* example I’ve been discussing, the CREMA group did challenge well established, reproducible findings made over 60 years. Previous experimental measurements of proton radius were based on two classes of methods: hydrogen spectroscopy and electron scattering. CREMA introduced a variation on spectroscopy by using muonic hydrogen and, almost by accident, decided at the last minute (and deviating from the prespecified plan) to also scan parameter regions that had been thought to have been ruled out by previous experiments. As a result they reported an estimate lower than the internationally accepted value. At this point, nobody should have been betting on anything. Instead, people did the obvious - make new attempts using the established methods…and lo and behold, their results started to move towards the lower value! Nonetheless the CREMA result was excluded from the internationally accepted “best fit” estimate for several cycles of updates. Except for this last nuance, which may reflect “groupthink” more than anything else, this is precisely how such discrepancies ought to be resolved. No betting needed.

*I mistakenly referred to this as “proton mass” in some of my earlier posts.

In experimental life sciences, an even more interesting example is provided by Lithgow, Driscoll, and Phillips, who represent competing labs working on roundworm lifetime studies. The paper is short and nontechnical, so I won’t summarize it here.

Yet another example is discussed in the 2017 article by Gunter and me, linked above, where we write:

Consider this: to get a better handle on the “reproducibility” of published research, the Reproducibility
Project has set about trying to reproduce some prominent results in cancer biology. At the time of writing, they claim that four of seven reports were reproduced. However, Jocelyn Kaiser reports in Science that: “Some scientists are frustrated by the Reproducibility Project’s decision to stick to the rigid protocol registered with eLife for each replication, which left no room for troubleshooting.” In other words, pre-specification is good for inference, but bad for science.

FPP also requires the original authors to prespecify the protcol for any replication attempt. Thus a third problem with FPP is that it does not show how discrepancies can and should be resolved, while the examples I posted here do. FPP focuses on a vary narrow vision of “replicability” driven by statistical inferences, which doesn’t seem to respect the lesson he himself presents as Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” Instead, scientists should focus on the nuts and bolts, as the proton radius and roundworm liftetime examples illustrate. More promising (to me) institutional procedures to stimulate/incentivize appropriate attitudes include the following:

(Incidentally, the Reproducibility Project is now complete, and the results are not particularly encouraging. Reproducibility Project: Cancer Biology | Collections | eLife )

FPP would not be applicable to much of the social sciences for a different reason - these sciences (eg, much of empirical economics) depend on observational data. “Replication” in these settings will always be muddied by confounding - a new population would need to be sampled, differing in either time, geography, market conditions, or other factors. Following exactly the same replication protocol could be expected to provide different results, and it won’t be clear if the difference is due to the confounding effects, or to fundamental flaws of the original study. FPP is a non-starter for such research (though, the same critique would likely hold for the Mogil/Macleod proposal I linked above).

Finally, it should be reiterated that Crane’s critique of Naive Probabalism (which he wrote subsequent to the FPP paper) is basically what I’ve been railing about on this thread from the beginning (though I didn’t know the term until @R_cubed helpfully made me aware of it; I’d been using Taleb’s related but more obscure term “the ludic fallacy”). I quote Crane’s first paragraph.

Naive probabilism is the (naive) view, held by many technocrats and academics, that all rational thought boils down to probability calculations. This viewpoint is behind the obsession with `data-driven methods’ that has overtaken the hard sciences, soft sciences, pseudosciences and non-sciences. It has infiltrated politics, society and business. It’s the workhorse of formal epistemology, decision theory and behavioral economics. Because it is mostly applied in low or no-stakes academic investigations and philosophical meandering, few have noticed its many flaws. Real world applications of naive probabilism, however, pose disproportionate risks which scale exponentially with the stakes, ranging from harmless (and also helpless) in many academic contexts to destructive in the most extreme events (war, pandemic).

Nothing would stop others from betting on longer time frames.

But the original authors would not be obligated to be counterparties to such bets? After all, the original authors are subject to an exposure limit.

While your digressions into the history of physics are interesting, I again see you falling into the pattern noted above:

I take this aversion to probability and decision theory as having the perfect be the enemy of the good, much like the nihilist who argues because there is the possibility of error, no knowledge is possible.

My question: how could betting markets conception of stats and probability be worse than what occurs in large areas of science now? Fisher [1] himself used game theoretic arguments to justify procedures he invented. Betting arguments have been critical for probability theory.

A primary researcher of this topic of information synthesis via betting markets is Robin Hanson from George Mason University. Prior to earning a doctorate in economics, his academic background includes a graduate level (M.S.) degree in physics and philosophy of science.

In discussing the design of information markets, he writes

Information markets are institutions for accomplishing this task via trading in speculative markets, at least on topics where truth can be determined later. Relative to simple conversation, it is harder to say any one thing by market trading, and there are fewer things one can say. When people do not know whom to believe, however, the fact that a trader must put his money where his mouth is means that the things people do say in information markets can be more easily believed. Those who know little are encouraged to be quiet, and those who think they know more than they do learn this fact by losing.

Getting back to stat intuition – I’ve found nothing clarifies my thought more than considering whether a proposition or procedure is worth betting a day or a weeks pay on.

It is worthwhile for me to figure out how to derive a proof sketch based on betting type arguments among rational traders, regarding why the practices in the following thread are wrong:

Further Reading

  1. Senn, S. (1994), Fisher’s game with the devil. Statist. Med., 13: 217-230.

I would rephrase it as the Successful is the enemy of the merely Conjectural. The examples I gave show how troubleshooting in the wake of an apparently “failed” replication can lead to the production of new knowledge and scientific advance. The Lithgow et al roundworm lifetime example (which is from biology, not physics) provides a template others can follow; so does the proton radius example (minus the unfortunate groupthink aspect). In other words, Freedman’s “shoe leather” works. If anyone has a concrete example/case study of FPP creating new knowledge from a situation of disputed or failed “replication”, bring it forth. I note that Harry Crane (who proposed FPP) and Ryan Martin’s Researchers.One platform provides no FPP mechanism to its authors and reviewers, although it handles other financial transactions (paper submission fee, donations).