Categorizing Continuous Variables

The 3rd edition cannot come out soon enough. One minor typo here:

The other is recognising that this is a stupid way to model such a variable anyway and wondering whether some more flexible technique such as splines(Harrell, F., 2015) or fractional polynomials(Royston, P. & Altman, D. G., 1994; Royston, P. & Sauerbrei, W., 2005) ought not to be used.

I think it should be: ought to be used.

but if you have age as a stratification factor, then very likely you will plan some subgroup analysis implying an interaction with the categorical age variable?

look forward to the 3rd edition. Would it be 2021?

Subgroup analysis doesn’t cut it. Model based interacting-factor-specific treatment effects are recommended, with age as continuous.

2 Likes

and if a plot is too “expensive” (too much space in the paper) you just quote the estimate for some pre-specified age=X? cheers

Show a graph with compatibility bands: y=treatment difference x=age

2 Likes

This might be a stupid question, but if categorization of continuous variables gives cause to several problems (as mentioned here), doesn’t this somehow also gives rise to problems when doing classification ? Often that require you to set some kind of threshold before your model can “correctly” classify you into one category. So a model might even have predictors that have been categorized from continuous values, but you also have to “categorize” the end result into one or more categories.

I remembered this being discussed awhile back. Here is a link to the thread.

I would also study the blog post Dr. Harrell mentioned in that thread. I post the link for your convenience.

I guess the key point is: in order to maximize the use of the information available, all inputs are kept in a continuous form until the actual time a decision needs to be made.

When you think about it as an engineer, physicist, or numerical analyst might, this makes sense in terms of the procedure maintaining numerical stability.

Every approximation of a continuous quantity introduces error. When more error is introduced during the modelling process (ie. in the terms of a regression model), the more difficult it is to place reliable bounds upon the output of the model. A small change in inputs could lead to a very large change in outputs.

If you think about this in a regression context, the least amount of error is introduced if we dichotomize at the point in time when an item needs to be classified (when element x is inserted into set categories Y_1,... Y_N).

So the simple answer is: “Don’t dichotomize until the end of the modelling process.”

A more complicated answer is: “Any approximation introduced in the modelling process must be examined for the error it introduces.” There are other theorems that can guide the modeller on when an approximation can be inserted for a continuous quantity.

Addendum: I think this link is a more general discussion of the issue of mapping a continuous function (outputs of some model) to a discrete one (ie. choice/decison function).


2 Likes

Ah, thank you.

This actually makes perfect sense. But to continue down this path, what is the best/better way to then actually choose your threshold ? I mean, let’s say your threshold is 10 (some value). You then use your model, and before the final classification step you get the number 9, or even 9.9 if it is not discrete values. Then this is classified into one group. The next data you use your model on, you get a value of 11 or 10.1, and this is then classified into another group. Only by a separation of 0.1 (or less). This is what bothers me the most. How do one actually choose the right threshold ?

1 Like

Blockquote
How do one actually choose the right threshold?

If I’m breaking this down in to the decision theoretic framework properly, you are asking about the procedure that maps the choice/decision function onto the function that estimates the “state of nature” or the set of “possible outcomes.”

That is context-sensitive. In a purely scientific context, there is no “right” threshold. I think it is agreed that simply reporting a smoothed function of the model outputs is best. This allows the decision maker to choose the threshold.

In an applied context, the modeller would need to elicit utilities and probabilities from the ultimate decision maker, and then do an analysis of the cost/loss for all possible actions, for all states of nature, conditional on the output of the model.

One criterion would be to maximize expected utility (ie. minimize expected cost) if there is confidence in the contextual information and probability assessments. This one has many attractive properties and is consistent with a Bayesian attitude towards probability.

It is also possible to minimize the worst case loss (mini-max), or regret (the loss experienced after the state of nature is revealed). Bayesian methods can also address this attitude towards risk. But most criticisms of Bayesian philosophy are concerned with the risk they introduce when wrong.

Here is a nice link on the formal tools of decision analysis:
https://home.ubalt.edu/ntsbarsh/business-stat/opre/partIX.htm#rwida

3 Likes

The caution against categorizing continuous predictors makes a lot of sense and really resonates with me. The discussion in the RMS videos is mostly in a biological/medical context, which is understandable. Do you think there are bio-social variables that are worth dichotomizing, depending on the purpose and context of the study? I’m thinking age, dichotomized at 18, because in many states that turnover has substantive meaning about what people can and cannot do. Although maybe the proper thing to do is to create/use a proper dichotomous variable, like “less than legal age” vs “more than legal age”.

1 Like

The case you described involves a legal mandate, which creates a discontinuity). Legal mandates comprise most of the situations where I can imagine a threshold being appropriate.

This is a good point. Many scales used in social, mental health and behavioural research have no definable unit, so the usual interpretation of the regression coefficient isn’t helpful.

One approach is to use quantiles of the predictor. How much information is lost when you convert a continuous variable to quantiles. One way of looking at it is the effect on the effective sample size. So I ran some numbers.

Converting a continuous variable to deciles seems attractive. It reduces your effective sample size by 1%. Quintiles reduces it by 4%, and quartiles 6·25%. Tertiles, however, reduces it by a little over 11% and dichotomisation at the median loses a whopping 25% of effective sample size.

Dichotomising, as you might expect, gets worse and worse as the splits become less equal. A split at the 60th or 40th percentile reduces effective sample size by 28%, and at the 70th or 30th reduces it by 37%. A 20/80 split reduces it by 52% and a 90/10 split really throws your sample away – effective sample size down by 73%.

From the interpretation point of view, expressing the effect of a predictor like burnout, worry, prosocial behaviour or religiosity is improved by using quantiles. Looking at the numbers, I guess that deciles and quintiles offer the best interpretability combined with the least loss of effective sample size.

You are taking as a starting point that loss of effective sample size is OK. It’s not. The meaning of the predictors needs to be respected. Categorization does not do that. And be sure to look at integrated mean squared prediction error or mean absolute error on an independent dataset. It’s easy to simulate a 50,000 observation independent datasets for checking overall accuracy. Categorization assumes flat relationships within categories, which you have also not demonstrated.

1 Like

Absolutely not! No – loss of effective sample size is not well enough appreciated. Researchers will look for associations between binary variables and an outcome variable, unaware of the relationship between the prevalence of the predictor variable and the power. One way of expressing this is by taking the ideal case – a 50% prevalence – and looking at the effective sample size for different prevalences compared with this.

1 Like

In this paper predicting the risk of cruciate ligament rupture based on time of neutering in Labrador Retreivers the following result is presented:

. Risk of CR was increased in dogs
neutered before 12 months of age (OR = 11.38; P
= .01). Neutering before 6 months of age was not
a significant factor (P = .17), nor was neutering be
tween 6 and 12 months of age (OR = 3.11; P = .23).
Overall neutering was also not a risk factor (OR = 1.8;
P = .27).

It makes no sense that there is an increased risk in dogs neutered before 12 months of age while not between at less than 6 months or between 6-12 months. Is this an issue with dichotomizing the month of neutering? Unfortunately the paper provides very little data to evaluate.
cruciaterupture.pdf (648.8 KB)

1 Like

This is a good example of misleading subgroup statistics after arbitrary categorization. Every analysis should start with (1) a high-resolution histogram of the data (here, age) to check regions of support, and (2) a smooth, non-overfitted relationship with uncertainty bands (using splines, nonparametric smoothers, fractional polynomials, etc.). To safeguard interpretations the uncertainty bands should be simultaneous compatibility (confidence) intervals.

1 Like

Can categorizing a continuous variable create Simpson’s Paradox?

Not sure but I think it’s possible that a form of the paradox could happen. The issue with the so-called paradox is the failure to condition on other relevant variables, and forcing something to be linear is similar to having omitted variables.

2 Likes

That is pretty weird. I don’t see a Data Availability statement in the paper, but what’s the ethos in the veterinary research community? Can you reach out to them to request data?

I think I might but it doesnt seem to be routine.