Categorizing Continuous Variables

f2harrell · June 4, 2020, 11:24am

Problems Caused by Categorizing Continuous Variables

References | Key reference | Key Reference for Response Variables | More References | Myths About Risk Thresholds | Dichotomania

Optimum decisions are made by applying a utility function to a predicted value (e.g., predicted risk). At the decision point, one can solve for the personalized cutpoint for predicted risk that optimizes the decision. Dichotomization on independent variables is completely at odds with making optimal decisions. To make an optimal decision, the cutpoint for a predictor would necessarily be a function of the continuous values of all the other predictors, as shown here in Section 18.3.1.
Loss of power and loss of precision of estimated means, odds, hazards, etc. Dichotomization of a predictor requires the researcher to add a new predictor to the mix to make up for the lost information.
Categorization assumes that the relationship between the predictor and the response is flat within intervals; this assumption is far less reasonable than a linearity assumption in most cases
Researchers seldom agree on the choice of cutpoint, thus there is a severe interpretation problem. One study may provide an odds ratio for comparing BMI > 30 with BMI <= 30, another for comparing BMI > 28 with BMI <= 28. Neither of these has a good definition and they have different meanings.
Categorization of continuous variables using percentiles is particularly hazardous. The percentiles are usually estimated from the data at hand, are estimated with sampling error, and do not relate to percentiles of the same variable in a population. Percentiling a variable is declaring to readers that how similar a person is to other persons is as important as how the physical characteristics of the measurement predict outcomes. For example, it is common to group the continuous variable BMI into quantile intervals. BMI has a smooth relationship with every outcome studied, and relates to outcome according to anatomy and physiology and not according to how many subjects have a similar BMI.
To make a continuous predictor be more accurately modeled when categorization is used, multiple intervals are required. The needed dummy variables will spend more degrees of freedom than will fitting a smooth relationship, hence power and precision will suffer. And because of sample size limitations in the very low and very high range of the variable, the outer intervals (e.g., outer quintiles) will be wide, resulting in significant heterogeneity of subjects within those intervals, and residual confounding.
Categorization assumes that there is a discontinuity in response as interval boundaries are crossed
Categorization only seems to yield interpretable estimates such as odds ratios. For example, suppose one computes the odds ratio for stroke for persons with a systolic blood pressure > 160 mmHg compared to persons with a blood pressure <= 160 mmHg. The interpretation of the resulting odds ratio will depend on the exact distribution of blood pressures in the sample (the proportion of subjects > 170, > 180, etc.). On the other hand, if blood pressure is modeled as a continuous variable (e.g., using a regression spline, quadratic, or linear effect) one can estimate the ratio of odds for exact settings of the predictor, e.g., the odds ratio for 200 mmHg compared to 120 mmHg.
When the risk of stroke is being assessed for a new subject with a known blood pressure (say 162), the subject does not report to her physician “my blood pressure exceeds 160” but rather reports 162 mmHg. The risk for this subject will be much lower than that of a subject with a blood pressure of 200 mmHg.
If cutpoints are determined in a way that is not blinded to the response variable, calculation of P -values and confidence intervals requires special simulation techniques; ordinary inferential methods are completely invalid. For example, if cutpoints are chosen by trial and error in a way that utilizes the response, even informally, ordinary P -values will be too small and confidence intervals will not have the claimed coverage probabilities. The correct Monte-Carlo simulations must take into account both multiplicities and uncertainty in the choice of cutpoints. For example, if a cutpoint is chosen that minimizes the P -value and the resulting P -value is 0.05, the true type I error can easily be above 0.5; see here
Likewise, categorization that is not blinded to the response variable results in biased effect estimates (see this and this)
“Optimal” cutpoints do not replicate over studies. Hollander, Sauerbrei, and Schumacher (see here) state that “… the optimal cutpoint approach has disadvantages. One of these is that in almost every study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely difficult or even impossible. Altman et al. point out this problem for studies of the prognostic relevance of the S-phase fraction in breast cancer published in the literature. They identified 19 different cutpoints used in the literature; some of them were solely used because they emerged as the `optimal’ cutpoint in a specific data set. In a meta-analysis on the relationship between cathepsin-D content and disease-free survival in node-negative breast cancer patients, 12 studies were in included with 12 different cutpoints … Interestingly, neither cathepsin-D nor the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update of the American Society of Clinical Oncology.”
Cutpoints are arbitrary and manipulatable; cutpoints can be found that can result in both positive and negative associations (see this)
If a confounder is adjusted for by categorization, there will be residual confounding that can be explained away by inclusion of the continuous form of the predictor in the model in addition to the categories.
A better approach that maximizes power and that only assumes a smooth relationship is to use a restricted cubic spline (regression spline; piecewise cubic polynomial) function for predictors that are not known to predict linearly. Use of flexible parametric approaches such as this allows standard inference techniques (P -values, confidence limits) to be used
Considerable deterioration in performance in prediction models that dichotomise/categorise predictors. Prediction is already difficult - why make the task harder and the model less accurate - see here.

Papers and Links

Bad use of quantiles in epidemiology
Methods appraisal: dichotomania broken link
Screwup that helped covid kill
Cutpoints and contexts by Evan Busch

Interactive Demonstrations

Run under RStudio: require(Hmisc); getRs('catgNoise.r') or watch this video
Java applet demonstrating the information loss from dichotomizing a continuous variable
Median split demo
Hazards of categorizing continuous predictors video

Doug_Dame · June 4, 2020, 3:00pm

GOOD STUFF !!!

Re #7 - says: Categorization assumes that there is a discontinuity in response as interval boundaries are crossed.

You could say: Categorization CREATES an abrupt discontinuity in the modeled response as interval boundaries are crossed, which is unlikely to reflect reality.

pmbrown · June 5, 2020, 2:56pm

if you stratify randomisation by a continuous variable such as age, then you must make some categorisation of it (for practical purposes). When you do the primary analysis would you include age as categorical (as per the randomisation) or as continuous?

f2harrell · June 5, 2020, 3:04pm

This goes against @Stephen’s sage advice that the model drives the randomization, not the other way around. Depending on the goal for stratifying on age (if there is a good reason) one should use the whole age distribution in deciding which patients to admit to the RCT, and at any rate age should always be continuous in the analysis. Otherwise serious lack of model fit will result.

Stephen · June 8, 2020, 1:07pm

This is what I propose to say in the 3rd edition of Statistical Issues in Drug Development

Blockquote

Practical problems of stratification include that it requires turning continuous covariates into categorical ones. A simple approach might seem to be use quantiles of a distribution to do this, most simply the median, so that one could create two equal (in frequency) strata defined as being above or below the median. However, since patients are recruited sequentially, what the median of those recruited will be will not be known until recruitment is complete. Hence one would have to use predicted quantiles to do this. In practice, very often some rather arbitrary categories of a variable are used. Age provides a very good example. Sometimes I have encountered statisticians wrestling to reconcile two rather different randomisation philosophies. One is that of respecting the design so, for example, if the design has been stratified by five age groups including age as a factor with five levels in the model. The other is recognising that this is a stupid way to model such a variable anyway and wondering whether some more flexible technique such as splines(Harrell, F., 2015) or fractional polynomials(Royston, P. & Altman, D. G., 1994; Royston, P. & Sauerbrei, W., 2005) ought not to be used. Of course, at the expense of some degrees of freedom one could always do both, but many would consider that this was over-egging the pudding.Practical problems of stratification include that it requires turning continuous covariates into categorical ones. A simple approach might seem to be use quantiles of a distribution to do this, most simple the median, so that one could create two equal (in frequency) strata defined as being above or below the median. However, since patients are recruited sequentially, what the median of those recruited will be will not be known until recruitment is complete. Hence one would have to use predicted quantiles to do this. In practice, very often some rather arbitrary categories of a variable are used. Age provides a very good example. Sometimes I have encountered statisticians wrestling to reconcile two rather different randomisation philosophies. One is that of respecting the design so, for example, if the design has been stratified by five age groups including age as a factor with five levels in the model. The other is recognising that this is a stupid way to model such a variable anyway and wondering whether some more flexible technique such as splines(Harrell, F., 2015) or fractional polynomials(Royston, P. & Altman, D. G., 1994; Royston, P. & Sauerbrei, W., 2005) ought not to be used. Of course, at the expense of some degrees of freedom one could always do both, but many would consider that this was over-egging the pudding.

f2harrell · June 8, 2020, 3:09pm

Beautiful - just need to edit out some repeated text. You might consider adding to the 3rd edition that there is another option which is to have a target age distribution based on what we know/don’t know about age interacting with treatment. Then do probability sampling within the set of volunteers to achieve that continuous distribution. For example you might not want too many young people in the sample but still want to have sufficiently many to get a crude look at consistency of treatment effect there.

Pavlos_Msaouel · June 8, 2020, 3:24pm

The 3rd edition cannot come out soon enough. One minor typo here:

The other is recognising that this is a stupid way to model such a variable anyway and wondering whether some more flexible technique such as splines(Harrell, F., 2015) or fractional polynomials(Royston, P. & Altman, D. G., 1994; Royston, P. & Sauerbrei, W., 2005) ought not to be used.

I think it should be: ought to be used.

pmbrown · June 11, 2020, 1:47pm

but if you have age as a stratification factor, then very likely you will plan some subgroup analysis implying an interaction with the categorical age variable?

look forward to the 3rd edition. Would it be 2021?

f2harrell · June 11, 2020, 2:03pm

Subgroup analysis doesn’t cut it. Model based interacting-factor-specific treatment effects are recommended, with age as continuous.

pmbrown · June 11, 2020, 2:51pm

and if a plot is too “expensive” (too much space in the paper) you just quote the estimate for some pre-specified age=X? cheers

f2harrell · June 11, 2020, 3:41pm

Show a graph with compatibility bands: y=treatment difference x=age

Lynild · June 23, 2020, 1:21pm

This might be a stupid question, but if categorization of continuous variables gives cause to several problems (as mentioned here), doesn’t this somehow also gives rise to problems when doing classification ? Often that require you to set some kind of threshold before your model can “correctly” classify you into one category. So a model might even have predictors that have been categorized from continuous values, but you also have to “categorize” the end result into one or more categories.

R_cubed · June 23, 2020, 2:37pm

I remembered this being discussed awhile back. Here is a link to the thread.

I would also study the blog post Dr. Harrell mentioned in that thread. I post the link for your convenience.

I guess the key point is: in order to maximize the use of the information available, all inputs are kept in a continuous form until the actual time a decision needs to be made.

When you think about it as an engineer, physicist, or numerical analyst might, this makes sense in terms of the procedure maintaining numerical stability.

Every approximation of a continuous quantity introduces error. When more error is introduced during the modelling process (ie. in the terms of a regression model), the more difficult it is to place reliable bounds upon the output of the model. A small change in inputs could lead to a very large change in outputs.

If you think about this in a regression context, the least amount of error is introduced if we dichotomize at the point in time when an item needs to be classified (when element x is inserted into set categories Y_1,... Y_N).

So the simple answer is: “Don’t dichotomize until the end of the modelling process.”

A more complicated answer is: “Any approximation introduced in the modelling process must be examined for the error it introduces.” There are other theorems that can guide the modeller on when an approximation can be inserted for a continuous quantity.

Addendum: I think this link is a more general discussion of the issue of mapping a continuous function (outputs of some model) to a discrete one (ie. choice/decison function).

Lynild · June 23, 2020, 3:33pm

Ah, thank you.

This actually makes perfect sense. But to continue down this path, what is the best/better way to then actually choose your threshold ? I mean, let’s say your threshold is 10 (some value). You then use your model, and before the final classification step you get the number 9, or even 9.9 if it is not discrete values. Then this is classified into one group. The next data you use your model on, you get a value of 11 or 10.1, and this is then classified into another group. Only by a separation of 0.1 (or less). This is what bothers me the most. How do one actually choose the right threshold ?

R_cubed · June 23, 2020, 4:28pm

Blockquote
How do one actually choose the right threshold?

If I’m breaking this down in to the decision theoretic framework properly, you are asking about the procedure that maps the choice/decision function onto the function that estimates the “state of nature” or the set of “possible outcomes.”

That is context-sensitive. In a purely scientific context, there is no “right” threshold. I think it is agreed that simply reporting a smoothed function of the model outputs is best. This allows the decision maker to choose the threshold.

In an applied context, the modeller would need to elicit utilities and probabilities from the ultimate decision maker, and then do an analysis of the cost/loss for all possible actions, for all states of nature, conditional on the output of the model.

One criterion would be to maximize expected utility (ie. minimize expected cost) if there is confidence in the contextual information and probability assessments. This one has many attractive properties and is consistent with a Bayesian attitude towards probability.

It is also possible to minimize the worst case loss (mini-max), or regret (the loss experienced after the state of nature is revealed). Bayesian methods can also address this attitude towards risk. But most criticisms of Bayesian philosophy are concerned with the risk they introduce when wrong.

Here is a nice link on the formal tools of decision analysis:
https://home.ubalt.edu/ntsbarsh/business-stat/opre/partIX.htm#rwida

cryan · May 7, 2021, 6:14pm

The caution against categorizing continuous predictors makes a lot of sense and really resonates with me. The discussion in the RMS videos is mostly in a biological/medical context, which is understandable. Do you think there are bio-social variables that are worth dichotomizing, depending on the purpose and context of the study? I’m thinking age, dichotomized at 18, because in many states that turnover has substantive meaning about what people can and cannot do. Although maybe the proper thing to do is to create/use a proper dichotomous variable, like “less than legal age” vs “more than legal age”.

f2harrell · May 7, 2021, 6:29pm

The case you described involves a legal mandate, which creates a discontinuity). Legal mandates comprise most of the situations where I can imagine a threshold being appropriate.

RonanConroy · June 1, 2021, 11:04am

This is a good point. Many scales used in social, mental health and behavioural research have no definable unit, so the usual interpretation of the regression coefficient isn’t helpful.

One approach is to use quantiles of the predictor. How much information is lost when you convert a continuous variable to quantiles. One way of looking at it is the effect on the effective sample size. So I ran some numbers.

Converting a continuous variable to deciles seems attractive. It reduces your effective sample size by 1%. Quintiles reduces it by 4%, and quartiles 6·25%. Tertiles, however, reduces it by a little over 11% and dichotomisation at the median loses a whopping 25% of effective sample size.

Dichotomising, as you might expect, gets worse and worse as the splits become less equal. A split at the 60th or 40th percentile reduces effective sample size by 28%, and at the 70th or 30th reduces it by 37%. A 20/80 split reduces it by 52% and a 90/10 split really throws your sample away – effective sample size down by 73%.

From the interpretation point of view, expressing the effect of a predictor like burnout, worry, prosocial behaviour or religiosity is improved by using quantiles. Looking at the numbers, I guess that deciles and quintiles offer the best interpretability combined with the least loss of effective sample size.

f2harrell · June 1, 2021, 12:51pm

You are taking as a starting point that loss of effective sample size is OK. It’s not. The meaning of the predictors needs to be respected. Categorization does not do that. And be sure to look at integrated mean squared prediction error or mean absolute error on an independent dataset. It’s easy to simulate a 50,000 observation independent datasets for checking overall accuracy. Categorization assumes flat relationships within categories, which you have also not demonstrated.

RonanConroy · June 1, 2021, 3:47pm

Absolutely not! No – loss of effective sample size is not well enough appreciated. Researchers will look for associations between binary variables and an outcome variable, unaware of the relationship between the prevalence of the predictor variable and the power. One way of expressing this is by taking the ideal case – a 50% prevalence – and looking at the effective sample size for different prevalences compared with this.