Unfortunately I have no formal statistical training and my background consists of a few online courses on R and constant googling for answers on stack. So I have a bit of a mess of statistical concepts in my head that I don`t know how to correctly apply sometimes. But this interest of mine has led me to a “local” popularity of answering statistical questions and doing primitive statistical analysis from time to time for my fellow colleagues in my country of residency. I understand that my whole question will look like baby`s babble on Datamethods but I would be grateful if someone could help me put everything in it`s right place.

So I`ve been given a retrospective dataset on 293 hospitalised patients with COVID19 with around 40 categorical mostly binary variables and 20 numerical variables with info on day of the start of symptoms, day of admission and day of outcome. Unfortunately the dataset was collected retrospectively with no prior thought on the statistical or clinical hypothesis and the question posed to me is rather vague. To put it more concisely I am asked to find out whether there are any particular “clinical phenotypes” in this dataset. The only paper close to the subject on COVID19 I managed to find is this one on cluster analysis using PCAs from quantative variables.

I am aware that clustering algorithms pose the danger of finding clusters that may be clinically irrelevant. One such paper was published not so long ago in Lancet on the subtypes of adult-onset diabetes.

https://www.thelancet.com/journals/landia/article/PIIS2213-8587(18)30051-2/fulltext

I learnt about the paper from professor Harrell`s blog post on the subject.

Still I decided to go for this little endeavor of mine to find the mythical “phenotypes” I`ve been asked for. I found that an appropriate (if one can say so) method for mixed datasets both with categorical and numerical variables is to compute a distance matrix using Gower distance. I learnt from the article below (with all the R code) that this matrix can be used for clustering using partitioning around medoids

https://towardsdatascience.com/clustering-on-mixed-type-data-8bbd0a2569c3

Using the width of the silhouette I decided to go for 2 clusters in my dataset

This is what the tSNE graph showed

I thought that the practical significance of such clustering may be decided with survival analysis, so I made survival curves using Kaplan-Meier method and fitted a Cox proportional hazards model stratified by the clusters found.

This is the Kaplan-Meier curve I got with the logrank test result

And the proportional hazards model output:

## Summary

```
n= 293, number of events= 11
`coef exp(coef) se(coef) z Pr(>|z|)`
```

`cluster_gow2 2.312 10.094 1.058 2.185 0.0289 *`

```
`exp(coef) exp(-coef) lower.95 upper.95`
```

`cluster_gow22 10.09 0.09906 1.269 80.29`

Concordance= 0.701 (se = 0.078 )

Likelihood ratio test= 8.23 on 1 df, p=0.004

Wald test = 4.78 on 1 df, p=0.03

Score (logrank) test = 7.08 on 1 df, p=0.008

I`m merely trying to get into the head of a statistician. Is it sensible to dichotomize patients based on a plethora of variables? Does the seeming difference in survival mean that the clusters are meaningful? Or is it a result of some shifting bias of the algorithm? I understand now that although there is an apparent difference in survival I can say nothing on the definite variables that influenced the outcome the most. I now have to run a logistic regression on artificial clusters as my binary outcome variable if I want to know the predictor variables. This now all seems as an added complexity as I could have just tried to fit a multivariate Cox proportional hazards model controlling for covariates at the same time. What would be a more professional approach from the beginning?

My question`s scope is very general but nonetheless I`d like to further discuss appropriate cases of cluster analysis that would be clinically relevant maybe with published examples.