Glossary of Statistical Terms

Good suggestion. I lean towards extracted from this.

I was considering some definition like

A randomized controlled trial is a type of experiment where we control for the effect of a confounding variable by randomly assigning subjects into groups which will or will not receive a treatment. It is the gold-standard for establishing causality.

Not bad but I hesitate to keep using the word control to mean standard treatment or no treatment, or in the BMJ definition the use of “will not receive a treatment”. An active-control trial or just a head-to-head comparison of two modern treatments is a perfectly good RCT. Also there are randomized crossover studies. I’ve never been clear on whether we should imply only a parallel-group design when using the term RCT.


I feel there are good points in both sources. It should be clear that an RCT avoids bias, due to known and unknown confounders at the time of randomization.

The former is mentioned on Wikipedia, the latter in the BMJ paper.

RCTs are not restricted to two groups either


I added a somewhat comprehensive definition under clinical trials. Thanks both of you for input.

:new: I’m adding Julia Rohrer’s definitions of reproducibiity etc.

1 Like

I guess the important question is the intended purpose of the glossary. The definitions are often what is definitions rather than what’s the purpose of definitions.

For example, interquartile range is simply defined as the range between the outer quartiles. Fair enough, but we’re not told it’s a measure of data spread.

And observational studies are defined as Study in which no experimental condition (e.g., treatment) is manipulated by the investigator, i.e., randomization is not used which defines them by the absence of a feature. It would help to mention that they are frequently used to measure the characteristics of a study population and to look for associations between these characteristics. And maybe that they are frequently used in health research.

The level of detail, both in terms of formulas and examples, varies widely – paired data, for example, is quite comprehensive, while parametric model simply says A model based on a mathematical function having a few unknown parameters.

My advice would be to define a readership and to do a think-aloud protocol. I am constantly surprised by the explanations that practising doctors do and do not understand. Markov models, for example, are easy, while degrees of freedom (found in every paper in a journal club) is one that I still haven’t found an easy explanation for.

I must welcome the project, though, and wish it well. When two peoples speak different languages, the person who makes a good dictionary deserves to be honoured.

As a footnote : I have a very ancient copy of Johnson’s dictionary, and it does, indeed, define a lexicographer as A writer of dictionaries; a harmless drudge.

1 Like

I am still chuckling over the definition of “data science”. These definitions should be distributed Week 1 to any biostatistics course.


Ronan I think these are great observations. I would like to not have to think so much, by avoiding the issue of the particular audience (other than emphasizing clinicians). So my inclination is to just try to make each definition better. I would like to incorporate the fantastic suggestions you listed here. Further contributions welcomed!

Hi Frank! It would be very helpful to define the term “overfitting” in your glossary, in the context of regression modeling.

1 Like

Excellent suggestion. Just added it. See if the definition does the trick.

This is really excellent. I think an entry for causal inference could be useful.

I’ll need a volunteer for that one. I’m unqualified and would be too tempted to write “something that comes strictly from the study design and not the data, and for which further discussion is not very fruitful”. :slight_smile:

Very useful!

  • Suggest adding a “number needed to treat” definition.

  • There is a typo in the AI definition “AI is n procedure”.

  • Might be worth adding general and generalized linear model terms separately with a “see linear regression”

Great suggestions. Done. I did not add anything for “general linear model” but did for “generalized linear model”. Thanks for joining datamethods!

Thanks to my colleague @SpiekerStats we now also have a definition of causal inference.


Excellent addition. I think that the reference to Judea Pearl should be added to the references section at the end.

To the paper I link to or is there a better reference from him?

That paper is very comprehensive, readable, and freely available so I think it serves as a great reference.

1 Like

It would be great to add the definition of the likelihood or likelihood function. I am desperately looking for a good definition!

1 Like

Great idea. Done. I added also definitions for probability density functions, and added to the definition for maximum likelihood estimator. Any suggestions for improving these definitions are most welcomed.