Necessary/recommended level of theory for developing statistical intuition

cwatson · May 16, 2022, 12:42am

There are several good ones, the usefulness of which will depend on one’s experience.

Mike X. Cohen: Linear Algebra: Theory, Intuition, Code (2021): I think this one would be pretty good for self-study but I found it a little basic (only in the sense that there isn’t much depth). I really liked the author’s explanations for matrix multiplication, though. He also includes Matlab and Python code for various problems. This might be best for an absolute beginner or if you just need to quickly re-learn things. It is a very quick read, and quite cheap; definitely worth the money in my opinion.
Gilbert Strang: Introduction to Linear Algebra, 5th Edition (2016): I haven’t read the whole thing but you probably can’t go wrong w/ Strang’s book. This one also goes along with MIT OCW courses (18.06 is the standard undergrad course; 18.700 is the theory/proof-based one). He has a few different books so I don’t know which is the most recent, or what the differences are among them. This might be the one you want if you learn best in a classroom-like atmosphere; i.e., if you decide to complete the course through OCW/Edx. All lectures are posted on Youtube.
David C. Lay et al.: Linear Algebra and its Applications, 5th Edition (2014): I also haven’t read this but I always see this listed on Reddit or StackExchange in threads asking for recommendations.
Jim Hefferon: Linear Algebra: this is a free online textbook that I haven’t read but I’ve heard good things about it. The link to Leanpub actually lets you download the whole book when you click on “Sample”, but the author also accepts money for it.
James Gentle: Matrix Algebra: Theory, Computations and Applications in Statistics, 2nd Edition (2017): this is also not a (standard) textbook, nor is it introductory. But it is quite good as an intermediate-level reference. The author also has a whole Part dealing with numerical methods. I really liked this one but I don’t think it’s good for beginners, and perhaps not the best for self study.
Boyd & Vandenberghe: Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares (2018), which I think is more advanced (perhaps intermediate). It is freely available at the above link. Of interest, Gilbert Strang’s review says: “This book explains the least squares method and the linear algebra it depends on - and the authors do it right!”. But some reviews mention that it isn’t great for self study.

And if you’d like another intermediate text with excellent reviews you could check out Axler’s text, Linear Algebra Done Right, is very popular among mathematicians but from what I understand it is not meant for someone interested primarily in applications (e.g., he doesn’t “do” matrix computations). I own it but haven’t gotten around to going through it, yet. [Incidentally, MIT now uses it for their proof-based linear algebra course mentioned above.]

Finally, you could always search for threads on Reddit or StackExchange, such as this one which explicitly asks about statistics.

lachlan · May 16, 2022, 1:03am

Thanks! The Mike X Cohen text seems perfect for my purposes, as does the Basilevsky recommended above too. (I took some linear algebra classes in undergrad years ago but I am effectively starting from scratch).

JohnAtl · May 16, 2022, 11:16am

He has an accompanying free video course for the book too.

cwatson · May 17, 2022, 12:40pm

Thank you for responding and I agree with you here. I suppose I should have clarified that I wasn’t implying a PhD-level of theory background is necessary (unless measure theory is typically required for MS students as well; in that case perhaps it would be useful, but certainly not necessary for applied work).

I again agree completely with what you wrote. It is one of the reasons I enjoy the RMS text so much. In my view, the text seems to be at least as much of a “how to do X” text as it is a “when not to do A, B, and C (and why)” text (although @f2harrell can refute this view ). The latter – knowing, understanding, and being able to identify pitfalls/issues with certain approaches – is where I believe a strong foundation in (some of) the theory comes in. It is also I think at least as important as knowing which procedures/methods to use; e.g., in my earlier days as a reviewer my critical appraisals of manuscripts’ Statistical Methods section weren’t as good as they are now.

And thank you for sharing those articles. I again agree that reading about the history of science is important and I generally try to follow the path from, say, probability’s development from being used for gambling purposes to the current state of affairs.

ChristopherTong · May 18, 2022, 5:22am

A good retelling of probability’s development from the 17th century gambling settings to modern statistical analysis is given by Herbert Weisberg’s Willful Ignorance: The Mismeasure of Uncertainty (Wiley, 2014). Weisberg is a winner of the ASA Statistical Consulting award, but the title of his book makes clear that his perspective is very critical.

An even stronger case against what I call the Bernoulli-Fisher ideology (the belief/wish that empirical data can usefully be modeled using the mathematics of games of chance) is given by John Kay and Mervyn King, Radical Uncertainty: Decision Making Beyond the Numbers (W. W. Norton, 2020). Both Weisberg and K^2 remind us that 100 years ago, the economists Frank Knight and John Maynard Keynes emphasized the distinction between what K^2 call “resovable” uncertainty and “radical” uncertainty. Borrowing terms from Jimmie Savage, they argue that in “small world” problems- those that can be usefully modeled using the stochastic methods of statisticians - most of the uncertainty is resolvable. These are problems where the probabilty space can be defined - in principle, there is a set of all possible outcomes, and a CDF can be imagined for them, even if its parameters or even shape is unknown. In “large world” problems, the part of uncertainty captured by statistical model inferences is only a fraction (of unknown magnitude) of the total uncertainty, which cannot be quantified. And most of us live in (and analyze data from) the “large world”. I can’t make the argument as eloquently as they can, but the bottom line is that much of the time “we” (statisticians and users of their methods) are commiting Nassim Nicholas Taleb’s “ludic fallacy”. For example, consider time series forecasting of the price of Nickel on the London Metal Exchange. In March of this year, the LME halted Nickel trading for over a week - and canceled the previous 8 hours of trading! Perhaps you could have forecast the halt (though, no such halt appears in the training data - the LME has been criticized for lacking the guardrails that other financial markets have) but not the cancellation of $3.9 Billion of uncoerced trading.

A wholly different line of attack on the Bernoulli-Fisher ideology is given by Sander Greenland’s lectures that I linked above, and I won’t try to recapitulate his ideas here. (Incidentally my made-up term, “Bernouli-Fisher ideology”, is deeply unfair to both Bernoulli and Fisher, who never foresaw the excesses that the profession that honors them has committed in their wake.)

These arguments apply equally to frequentist, Bayesian, & likelihoodist inferential frameworks. (Remember, the likelihood principle has an important caveat: “given the model”.) Thus the eternal disputes among the schools of inference are, in my view, a massive and pointless distraction from the real decision - is it even reasonable to entertain stochastic modeling for a given data set? In most cases my personal answer is that such models are useful as descriptive and exploratory tools, but the inferences derived from such models are delusional. And thus to cycle back to the original question, statistical theory is not as useful to me as knowledge of the history of science, engineering, and medicine (especially the examples that did not use “statistics” - a few of which are discussed in the below links).

Some of my more polished views on several of these points appear elsewhere:
https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2017.01057.x

R_cubed · May 18, 2022, 3:12pm

As I prepare to take one of the entry level actuarial exams later in the year (ie. Probability) I wonder where these professionals would fit in? A significant amount of effort is devoted to modelling those uncertanties via extreme value theory, for example. Because the organizations that employ them bear the cost of errors when setting prices, the silliness seen in a typical social science applications are absent.

They would epitomize what Taleb and Harry Crane describe as having “skin in the game.” Crane in particular has proposed the radical notion that reported probabilities from models aren’t “real”, unless someone who makes such a claim is willing to make a wager at those implied odds. When looked at from an information theoretic standpoint, this is only taking JL Kelly’s 1956 paper “An Alternative Interpretation of the Information Rate” literally. I find substantial value in that point of view.

Crane, Harry (2018) The Fundamental Principle of Probability (preprint).

Crane, Harry (2018) Is Statistics Meeting the Needs of Science (preprint)

Crane, Harry (2020) Naive Probabilism (preprint)

Shafer, Glenn (2021) Testing by betting: A strategy for statistical and scientific communication (link)

cwatson · May 18, 2022, 11:37pm

Thanks for replying and for suggesting Weisberg’s book; it looks quite interesting. Speaking of probability/statistics texts, might you (or any other readers, of course) have any thoughts on some of these:

Jaynes’ probability text, which I mentioned in the first post. From what I have seen it is quite a good book but some consider it controversial.
Savage: The Foundations of Statistics (1972) – your mention of Savage reminded me again to look into this book. It seems to be well regarded, and being a Dover book is quite affordable.
Clayton: Bernoulli’s Fallacy (2021) – I don’t recall where I learned of this book (perhaps The American Mathematical Monthly). If you haven’t already, you can read EJ Wagenmaker’s review which, by itself, may have convinced me to purchase this text.

There are many other books that look interesting to me but I would end up spending too much time trying to find books instead of actually reading and studying probability & statistics!

P.S. I had previously read the paper you posted from The American Statistician but shall revisit it (along with other articles in the special issue). Very thought-provoking, and your list of references is impressive. There are many papers you cite that look quite worthwhile. Thanks for your helpful posts!

cwatson · May 18, 2022, 11:49pm

Thanks for the suggestion (I meant to reply sooner); I have heard good things about Ross’s text. Are you referring to Introduction to Probability Models, specifically? He also has a textbook called A First Course In Probability (and Introduction to Probability and Statistics for Scientists and Engineers, although I assume you weren’t thinking of this one).

ChristopherTong · May 19, 2022, 2:01am

Knight (and possibly Keynes) viewed insurance as one area that may be manageable using a probability framework, according to Faulkner et al. (2021) from whom I quote:

“Knight argues that the first type of probability [resolvable uncertainty in the K^2 parlance] ‘practically never’ (Knight, 1921, p. 225) arises in business but allows that there are situations in which statistical probabilities can sometimes be estimated. These are the cases in which it is then possible to achieve a kind of certainty by grouping cases: ‘an uncertainty which can by any method be reduced to an objective, quantitatively determinate probability, can be reduced to complete certainty by grouping cases’. The obvious example here is insurance, which works by ‘dealing with groups of cases instead of individual cases’ and allows risks to be transferred to the insurer in a way that leaves both the insured and the insurer better off.”

Faulkner et al, 2021: F. H. Knight’s Risk, Uncertainty, and Profit and J. M. Keynes’ Treatise on Probability after 100 years. Cambridge Journal of Economics, 45: 857-882. (This is the introduction to a special issue celebrating the centenary of the Knight and Keynes books discussed above.)

ChristopherTong · May 19, 2022, 2:26am

I have Savage’s book. K^2 borrow the “small world” and “large world” (Savage used “grand world”) terms from Savage, but put their own spin on it. And in contrast, K^2 provide blistering criticism of Savage’s subjective expected utility model in economics. (K^2 do not just attack statisics but much of economics as well.)

I am not familiar with Jaynes’ book, but as a physicist, I have an awarness of (and perhaps some sympathy to) his maximum entropy framework, which seems worthy of consideration. However I do not ultimately agree with him that probability should be “the” logic of science. At best, it is one of several frameworks that can be useful (and can also be misleading) in the scientific enterprise. There are simply too many successful examples from the history of physics itself where probability plays no role in scientific discovery and the interpretation of data.

I’ve read several reviews of Clayton but not the book itself. Criticisms of frequentism are entertaining and welcome (the Significance piece I coauthored with B. Gunter, linked above, is yet another anti-frequentist screed) but the solution is not Bayesianism or Likelihoodism. Both of the latter are founded on the Likelihood Principle, which requires (just as frequentism does) pre-specification of the statistical model (as Diaconis points out in his ‘magical thinking’ article). However, in exploratory work we need the ability to iteratively improve the fit of the model to the data, which invalidates the nominal probability properties of any resulting inference (whatever school of thought you use).

What’s common among all 3 books is that they do not attempt to move probability away from the heart of statistics. I’m trying to argue that most of the time we’re committing the ludic fallacy and we should take these stochastic models much less seriously. They are useful as tools to explore and summarize data, but not for drawing conclusions and making decisions, except in a very narrow set of conditions.

Thank you for starting this very stimulating thread.

f2harrell · May 19, 2022, 3:02am

I can attest to the greatness of books by Jaynes, Ross (Probability Models), Clayton.

R_cubed · May 19, 2022, 2:42pm

I enjoyed your American Statistician Article much more than expected. I’d especially recommend reading this citation

Nelder, J.A. (1986). Statistics, Science and Technology. Journal of the Royal Statistical Society: Series A (General), 149: 109-121. Statistics, Science and Technology on JSTOR

Blockquote
There are simply too many successful examples from the history of physics itself where probability plays no role in scientific discovery and the interpretation of data.

I recall in one of Richard Feynman’s Lectures on Physics, he pointed out that pushing the frontiers of physics was truly hard, because anything successful historically is tried first. That suggests to me not to draw too many conclusions from the early history of physics, but I might be mistaken about that.

Blockquote
…but the solution is not Bayesianism or Likelihoodism. Both of the latter are founded on the Likelihood Principle, which requires (just as frequentism does) pre-specification of the statistical model (as Diaconis points out in his ‘magical thinking’ article). However, in exploratory work we need the ability to iteratively improve the fit of the model to the data, which invalidates the nominal probability properties of any resulting inference (whatever school of thought you use).

This is a very subtle point that, despite having watched Feynman’s lectures on physics, as well as a number of his writings on scientific methods in physics, I can appreciate in much more deeply having read your article.

I think this point (which physical scientists seem to take for granted) that Feynman was attempting to express in his famous speech Cargo Cult Science.

The only other place where I saw it explicitly stated (but again missed out on the subtle implications) was in the classic Multiple Comparison Procedures by Hochberg and Tamhane. They emphasized that unless error probabilities are asserted before the data, no error rates are defined; and by implication, no information is communicated.

f2harrell · May 19, 2022, 2:48pm

There is a completely different way to look at this. Iterative model improvements should carry along uncertainties. If you for example forget that you allowed the data to be non-Gaussian at one point, the Bayesian uncertainty intervals will be too wide.

ChristopherTong · May 20, 2022, 2:03am

Good that you brought up Feynman - he explicitly dismissed what we now call HARKing (hypothesizing after results are known) in his book The Meaning of it All, as discussed in Sec. 8 of Gerd Gigerenzer’s classic paper “Mindless Statistics”

Having said that, Transparently HARKing is a hallmark of good science because it leads you to do new research: https://journals.sagepub.com/doi/full/10.1177/0149206316679487

Regarding history of physics, see the article by Seife (2000) I cited in an earlier post on this thread, for many examples in the recent history of physics where “statistically significant” results “crumbled to dust” as the author described, due to the later discovery of systematic errors. That paper is 20 years old, and a number of additional examples of the use of statistical methods leading astray have arisen since then, such as the controversy over the proton mass, described very nicely in this piece from Physics World.

ChristopherTong · May 20, 2022, 2:15am

I might add that the recent imaging of Black Holes that have made the front pages of the newspapers, is an example where the physicists were explicitly anti-Bayeisan - they went to extraordinary lengths not to let their prior expectations influence the analysis; this lent greater credibility to the final result.

ChristopherTong · May 20, 2022, 3:26am

In retrospect, the last two articles I posted, the proton mass article and the black hole article, are not about statistics but about groupthink. In the proton mass case, groupthink led to unwise data exclusion. The black hole article shows that there are formal procedures one can take to manage and mitigate, if not eliminate, the influence of groupthink, in that case from prior expectations or knowledge. John Tukey (1972) observed ”[T]he discovery of the irrelevance of past knowledge to the data before us can be one of the great triumphs of science.” (Q. of Applied Math, 30: 51-65). In the black hole case, the prior expectations turned out to be “right”, despite the efforts to prevent it from biasing the analysis; in the proton mass case, the prior knowledge is looking very shaky.

While Bayesian priors are more obviously subject to the effects of groupthink, even frequentist analysis can be undermined by it (eg, researcher degrees of freedom, forking paths, and all that). Statisticians have tools to help manage group think - mainly on the data production side (eg, randomization, blinding, concurrent control all help prevent the investigators from fooling themselves) but on the analysis side, perhaps our statistical intuition is underdeveloped? Wagenmakers et al’s proposal earlier this week, to have many independent stat analysis teams for the same data set, mimics one of the procedures used by the black hole team.

https://www.nature.com/articles/d41586-022-01332-8

R_cubed · May 21, 2022, 2:47pm

Blockquote
The black hole article shows that there are formal procedures one can take to manage and mitigate, if not eliminate, the influence of groupthink, in that case from prior expectations or knowledge.

While I have to study those examples more closely, I think the fundamental issue is the appropriate language for model uncertainty (ie. probabilities about models that output probabilities)

For a Bayesian, it is probabilities all the way down, as Frank’s post shows. Philosopher Richard Jeffreys also held that view.

Others, like Arthur Dempster and Glen Shafer, are not convinced that all uncertainty can be expressed by a single probability number. Their work lead to Dempster-Shafer theory, also known as the “Mathematical theory of Evidence” and is closely related to the notion of “imprecise” (aka interval) probabilities.

The work in this area is more likely to be published in symbolic logic journals than applied stats, but the tools developed there are likely (IMO) to productively resolve these philosophical disputes and lead to rigorous “statistical thinking” vs. the too common statistical rituals now at epidemic proportions.

Here are some interesting papers for the philosophically inclined. I’d start with the first one, and then the others for more formal development and justification.

Crane, H. (2018) Imprecise probabilities as a semantics for intuitive probabilistic reasoning (link)
Crane, H; Isaac, W (2018) Logic of Typicality (link)
Crane, H. (2018) Logic of Probability and Conjecture (link)

For applications the following are interesting; 2 is a mathematical formalization of the argument made by @Sander in numerous threads about Bayesian probabilities being too optimistic, to accept the idea that these models capture all uncertainty. Martin turns the absence of any truly “non-informative” prior to prove any additive system of representing beliefs runs the risk of false confidence. Formal discussion at the meta-level (ie. model criticism) can be productively done in the realm of non–additive beliefs.

Martin, R (2021) An imprecise-probabilistic characterization of frequentist statistical inference (link)
Martin, R (2019) False confidence, non-additive beliefs, and valid statistical inference (link)
Martin, R (2021) Valid and efficient imprecise-probabilistic inference across a spectrum of partial prior information (link)
Balch, M, Martin R. Scott, F. (2019) Satellite conjunction analysis and the false confidence theorem (link)

Christian P. Robert gives his opinion here:

ChristopherTong · May 21, 2022, 10:41pm

Thank you for these references, especially those to Crane (including from your earlier post) and the false confidence papers. I was not aware of this work, and I find I am sympathetic to much of Crane’s thinking (though I deviate sharply on a few points - I will need to start reading more of his corpus). Phenomena analagous to “probability dilution” is fairly routine in my work, and like Kay and King, and possibly Crane, I do not agree that it is the duty of statistics/computation/math to “solve” such problems by finding ways to quantify uncertainty, even if imprecisely. Neither my personal experience working with scientists for 20+years, nor my reading of the history of science, are consistent with Martin’s assertion, “I contend that scientists seek to convert their data, posited statistical model, etc., into calibrated degrees of belief about quantities of interest.” No scientist I’ve ever worked with has even behaved in such a manner, much less directly asked for this from me.

Perhaps because I’ve worked on the applied side of science, instead my collaborators are interested in building an evidence base for making decisions, including experimentally evaluating alternatives, under limited resource constraints. My role has been to help them design studies, and describe/summarize and interpret the resulting data. I rarely find it either warranted, necessary, or useful to include “probability as uncertainty” claims (including statistical inferences of any flavor) in advancing that goal, except in the narrow situations I allude to in my 2019 paper, which form a minority of what crosses my desk.

R_cubed · May 22, 2022, 8:01pm

After a study of the Carmichael and Williams exposition, and their references to the Fieller theorem and Gleser-Hwang theorem, the False Confidence result is less surprising.

Liseo, Brunero. (2003). Bayesian and conditional frequentist analyses of the Fieller’s problem. A critical review. Metron - International Journal of Statistics. LXI. 133-150. (link)

After showing the limitations of frequentist and non-informative Bayesian methods, the author

Blockquote
…adopts a robust Bayesian approach to show that it is nearly impossible to end up with a reasonable solution to the problem without introducing some prior information on the parameters

Upon reflection, I think the philosophical claims are too strong. I see no reason why the notion of an imprecise probability can be embedded in a broader Bayesian framework. A decision theoretic approach to the design of experiments implies uncertainty about the “true” probability distribution, and hence an imprecise probability, and imprecise probabilities are one way of conducting a robustness analysis of Bayesian methods. Still, the paper was very interesting.

ChristopherTong · May 23, 2022, 6:18am

To bring this discussion back to the original point about statistical intuition, Crane’s comment, in his “Naive Probabilism” article that you posted earlier, is crisp and emphatic:

“Probability calculations are very precise, and for that reason alone they are also of very limited use. Outside of gambling, financial applications, and some physical and engineering problems – and even these are limited – mathematical probability is of little direct use for reasoning with uncertainty.”

I think Kay & King (2020) are basically making the same point in a more long-winded (but more compelling) way. In my view, an important part of statistical intuition is the quality of judgment about whether a probability calculation could be both valid and useful. Statisticians tend to overdo it; frankly so do many physicists, perhaps because physics has so many cases where probability calculations are valid and useful. In his 1995 autobiography, Gen. Colin Powell wrote that “experts often have more data than judgment” and he was right.