In the blog post “Statistical Errors in the Medical Literature”, prof. Harrell criticized an application of **cluster analysis** on patients with type 2 diabetes.

I interpreted it this way: that finding clusters with no apparent clinical meaning is often no more than an exercise, and that regression methods are superior to looking for clusters.

I was recently approached by a researcher who is an expert in cluster analysis and who proposed to apply such method to a data set my group has created. Our case is slightly different from the blog post above, as the data set comprises patients who fall into various different diagnostic categories (i.e. different underlying conditions/management pathways) rather than a single one (diabetes) but present acutely to the ED with similar symptoms.

My questions are the following:

- is cluster analysis
*inherently* bad for medical research where both continuous and discrete variables are used as the endpoint?
- What would you ask to this expert to demonstrate the value (or lack thereof) of their method over regression modeling? In their post, prof. Harrell mentioned Gini’s mean difference.

Thank you very much for helping me with this

G

Here are some pertinent excerpts from a statistical analysis plan I recently developed.

Consider whether variable clustering or patient clustering better meets the aims. Some of the considerations are:

- variable clustering requires variable grouping decisions but the scoring of clusters respects the continuous nature of the variables
- variable clustering does not have a dimensionality curse as severe as observation clustering
- observation (patient) clustering tends to produce a more arbitrary number of clusters, and also tends to oversimplify the patterns
- observation clustering may require a careful choice of cluster centers; variable clustering does not have centers

Even when observation clustering is the ultimate goal (why?), it is usually best to first perform variable clustering so that observation clustering is not challenged by collinearities.

State the justification for choosing one or the other clustering approach. State the particular algorithm that will be used and how that algorithm has had its performance and reproducibility checked in general.

### Validation of Clusters

Use the chosen resampling strategy to document stability of found clusters. When the number of clusters was not completely pre-specified (before analyzing the data), the number of clusters should be allowed to “float” across resamples, and the frequency distribution of the number of found clusters provided.

Observation clustering requires special validation steps to avoid arbitrariness, loss of information, and duplication of information already present in simpler representations of the data. For these reasons, cluster structure and information content need to be validated in a variety of ways that need to be included in the SAP. Some of these steps are as follows.

- Demonstrate that the clusters are indeed clusters (compact sets) by computing distances from cluster centers and all the patients assigned to that cluster. The distributions of these distances should be narrow. To put “narrow” in context, consider computing the median distance between all possible pairs of cluster centers, and show that the individual patient distances from their own cluster centers is below, say 1/5th of the median distance between cluster centers more than 4/5 of the time. Also verify that it is very rare that two patients assigned to different clusters are closer to each other than they are to their own cluster centers.
- Demonstrate that the cluster identifiers are sufficient for conveying the information (e.g., phenotypes) the clusters are purported to contain, when there is an outcome or response variable that the clusters are supposed to predict.
- Define A as the set of k-1 indicator variables for membership in k clusters
- Define B as the set of k distances a patient has from each of the cluster centers
- Fit response models containing both sets A and B, and models containing A and B separately
- Compute likelihood ratio \chi^2 tests to assess the prognostic information due to each set
- Compute the proportion of overall likelihood ratio \chi^2 for A & B combined that is due to each of the sets
- Verify that the proportion of predictive information provided by B, after adjusting for A, is small. See this and this for more information.

- Demonstrate that the clusters provide new prognostic information after accounting for previously known prognostic variables. In a similar fashion to the previous demonstration, replace set B with known prognostic variables and compute the fraction of new prognostic information that is provided by the k-1 cluster indicators.
- Demonstrate that cluster assignments cannot be easily predicted from simple features, using for example polytomous (multinomial) logistic regression.

1 Like