A glossary of statistical terms, intended primarily for non-statisticians, may be found here. The purpose of this topic is to provide a place for suggested improvements to any of the existing definitions, or to add new definitions for terms not already covered in the glossary. Feel free to propose your own definitions. You will be credited for material that is added to the glossary.

# Glossary of Statistical Terms

**RBrinks**#3

Thanks Frank, this is great and helps me (and possibly my students) a lot.

Pls allow a comment for the item â€śrateâ€ť which you write is unconstrained. As far as I see it, a rate is always non-negative.

As an aside, I appreciate thay you see a rate as something â€śunit per timeâ€ť and that it shouldnâ€™t be confused with probabilities. In the definition of type I and type II errors, you use â€śfalse positive rateâ€ť and â€śfalse negative rateâ€ť as synonyms. I guess in this sense, you interpret type I and type II errors â€śratesâ€ť as probs (sry for nitpicking).

Thanks and best wishes,

Ralph

**f2harrell**#4

Great input. Iâ€™ll correct the phrase about the constraint and change rates to probabilities in type I and type II errors. Iâ€™m glad you caught that because Iâ€™m always fussing about type I error *rates* being the wrong term when I want to be nitpicky.

**Peter_R_S**#5

@RBrinks agree to all your points, except rates and â€śunit per timeâ€ť. In my opinion the denominator does not necessarily have to be time. I consider e.g. â€śfalls per distance walkedâ€ť or â€śbacteria per surface areaâ€ť as rates too. Guess the point is, that the numerator is not a sub set of the denominator.

**f2harrell**#6

I think youâ€™re right. Does anyone else think that a rate needs to be a derivative? Or do others agree that falls per distance walked is truly a rate? Here are the Oxford Dictionary definitions for *rate* as a noun. Seems pretty inclusive.

I updated the glossary with @Peter_R_S suggestions.

**zinkov**#9

I was considering some definition like https://emj.bmj.com/content/20/2/164:

A randomized controlled trial is a type of experiment where we control for the effect of a confounding variable by randomly assigning subjects into groups which will or will not receive a treatment. It is the gold-standard for establishing causality.

**f2harrell**#10

Not bad but I hesitate to keep using the word *control* to mean standard treatment or no treatment, or in the BMJ definition the use of â€świll not receive a treatmentâ€ť. An active-control trial or just a head-to-head comparison of two modern treatments is a perfectly good RCT. Also there are randomized crossover studies. Iâ€™ve never been clear on whether we should imply only a parallel-group design when using the term RCT.

**Peter_R_S**#11

I feel there are good points in both sources. It should be clear that an RCT avoids bias, due to *known and unknown* confounders *at the time of randomization*.

The former is mentioned on Wikipedia, the latter in the BMJ paper.

RCTs are not restricted to two groups either

**f2harrell**#12

I added a somewhat comprehensive definition under *clinical trials*. Thanks both of you for input.

Iâ€™m adding Julia Rohrerâ€™s definitions of reproducibiity etc.

**RonanConroy**#13

I guess the important question is the intended purpose of the glossary. The definitions are often *what is* definitions rather than *whatâ€™s the purpose of* definitions.

For example, interquartile range is simply defined as the range between the outer quartiles. Fair enough, but weâ€™re not told itâ€™s a measure of data spread.

And observational studies are defined as *Study in which no experimental condition (e.g., treatment) is manipulated by the investigator, i.e., randomization is not used* which defines them by the absence of a feature. It would help to mention that they are frequently used to measure the characteristics of a study population and to look for associations between these characteristics. And maybe that they are frequently used in health research.

The level of detail, both in terms of formulas and examples, varies widely â€“ *paired data*, for example, is quite comprehensive, while *parametric model* simply says *A model based on a mathematical function having a few unknown parameters*.

My advice would be to define a readership and to do a think-aloud protocol. I am constantly surprised by the explanations that practising doctors do and do not understand. Markov models, for example, are easy, while degrees of freedom (found in every paper in a journal club) is one that I still havenâ€™t found an easy explanation for.

I must welcome the project, though, and wish it well. When two peoples speak different languages, the person who makes a good dictionary deserves to be honoured.

As a footnote : I have a very ancient copy of Johnsonâ€™s dictionary, and it does, indeed, define a lexicographer as *A writer of dictionaries; a harmless drudge*.

**cvmsteve**#14

I am still chuckling over the definition of â€śdata scienceâ€ť. These definitions should be distributed Week 1 to any biostatistics course.

**f2harrell**#15

Ronan I think these are great observations. I would like to not have to think to much, by avoiding the issue of the particular audience (other than emphasizing clinicians). So my inclination is to just try to make each definition better. I would like to incorporate the fantastic suggestions you listed here. Further contributions welcomed!