Glossary of Statistical Terms

f2harrell · January 10, 2019, 5:17pm

A glossary of statistical terms, intended primarily for non-statisticians, may be found here. The purpose of this topic is to provide a place for suggested improvements to any of the existing definitions, or to add new definitions for terms not already covered in the glossary. Feel free to propose your own definitions. You will be credited for material that is added to the glossary.

SameeraDaniels · January 11, 2019, 5:01am

Thank you Frank. Very helpful for review and revision purposes.

RBrinks · January 11, 2019, 9:28am

Thanks Frank, this is great and helps me (and possibly my students) a lot.
Pls allow a comment for the item “rate” which you write is unconstrained. As far as I see it, a rate is always non-negative.
As an aside, I appreciate thay you see a rate as something “unit per time” and that it shouldn’t be confused with probabilities. In the definition of type I and type II errors, you use “false positive rate” and “false negative rate” as synonyms. I guess in this sense, you interpret type I and type II errors “rates” as probs (sry for nitpicking).

Thanks and best wishes,
Ralph

f2harrell · January 11, 2019, 1:11pm

Great input. I’ll correct the phrase about the constraint and change rates to probabilities in type I and type II errors. I’m glad you caught that because I’m always fussing about type I error rates being the wrong term when I want to be nitpicky.

Peter_R_S · January 11, 2019, 2:19pm

@RBrinks agree to all your points, except rates and “unit per time”. In my opinion the denominator does not necessarily have to be time. I consider e.g. “falls per distance walked” or “bacteria per surface area” as rates too. Guess the point is, that the numerator is not a sub set of the denominator.

f2harrell · January 11, 2019, 2:20pm

I think you’re right. Does anyone else think that a rate needs to be a derivative? Or do others agree that falls per distance walked is truly a rate? Here are the Oxford Dictionary definitions for rate as a noun. Seems pretty inclusive.

I updated the glossary with @Peter_R_S suggestions.

zinkov · January 12, 2019, 4:25pm

I’d like a definition for randomized controlled trials (RCT).

f2harrell · January 12, 2019, 6:27pm

Good suggestion. I lean towards extracted from this.

zinkov · January 12, 2019, 8:13pm

I was considering some definition like https://emj.bmj.com/content/20/2/164:

A randomized controlled trial is a type of experiment where we control for the effect of a confounding variable by randomly assigning subjects into groups which will or will not receive a treatment. It is the gold-standard for establishing causality.

f2harrell · January 13, 2019, 1:21pm

Not bad but I hesitate to keep using the word control to mean standard treatment or no treatment, or in the BMJ definition the use of “will not receive a treatment”. An active-control trial or just a head-to-head comparison of two modern treatments is a perfectly good RCT. Also there are randomized crossover studies. I’ve never been clear on whether we should imply only a parallel-group design when using the term RCT.

Peter_R_S · January 13, 2019, 8:04pm

I feel there are good points in both sources. It should be clear that an RCT avoids bias, due to known and unknown confounders at the time of randomization.

The former is mentioned on Wikipedia, the latter in the BMJ paper.

RCTs are not restricted to two groups either

f2harrell · January 14, 2019, 2:45pm

I added a somewhat comprehensive definition under clinical trials. Thanks both of you for input.

I’m adding Julia Rohrer’s definitions of reproducibiity etc.

RonanConroy · January 15, 2019, 2:59pm

I guess the important question is the intended purpose of the glossary. The definitions are often what is definitions rather than what’s the purpose of definitions.

For example, interquartile range is simply defined as the range between the outer quartiles. Fair enough, but we’re not told it’s a measure of data spread.

And observational studies are defined as Study in which no experimental condition (e.g., treatment) is manipulated by the investigator, i.e., randomization is not used which defines them by the absence of a feature. It would help to mention that they are frequently used to measure the characteristics of a study population and to look for associations between these characteristics. And maybe that they are frequently used in health research.

The level of detail, both in terms of formulas and examples, varies widely – paired data, for example, is quite comprehensive, while parametric model simply says A model based on a mathematical function having a few unknown parameters.

My advice would be to define a readership and to do a think-aloud protocol. I am constantly surprised by the explanations that practising doctors do and do not understand. Markov models, for example, are easy, while degrees of freedom (found in every paper in a journal club) is one that I still haven’t found an easy explanation for.

I must welcome the project, though, and wish it well. When two peoples speak different languages, the person who makes a good dictionary deserves to be honoured.

As a footnote : I have a very ancient copy of Johnson’s dictionary, and it does, indeed, define a lexicographer as A writer of dictionaries; a harmless drudge.

cvmsteve · January 15, 2019, 3:06pm

I am still chuckling over the definition of “data science”. These definitions should be distributed Week 1 to any biostatistics course.

f2harrell · January 15, 2019, 5:04pm

Ronan I think these are great observations. I would like to not have to think so much, by avoiding the issue of the particular audience (other than emphasizing clinicians). So my inclination is to just try to make each definition better. I would like to incorporate the fantastic suggestions you listed here. Further contributions welcomed!

drjgauthier · August 8, 2019, 4:14am

Hi Frank! It would be very helpful to define the term “overfitting” in your glossary, in the context of regression modeling.

f2harrell · August 8, 2019, 2:20pm

Excellent suggestion. Just added it. See if the definition does the trick.

jumbanhounc · August 13, 2019, 1:23pm

This is really excellent. I think an entry for causal inference could be useful.

f2harrell · August 13, 2019, 1:47pm

I’ll need a volunteer for that one. I’m unqualified and would be too tempted to write “something that comes strictly from the study design and not the data, and for which further discussion is not very fruitful”.

fmaguire · August 13, 2019, 2:21pm

Very useful!

Suggest adding a “number needed to treat” definition.
There is a typo in the AI definition “AI is n procedure”.
Might be worth adding general and generalized linear model terms separately with a “see linear regression”