Phenotyping Clinical Syndromes

In this study, Seymour and colleagues published an extensive analysis looking to characterize clinical phenotypes for patients with Sepsis. They used several analytic techniques, and several datasets of patients to achieve this objective, and as such there is a lot to unpack. I’ll start with a few questions and thoughts I had.

  1. Philosophically, do different phenotypes for the sepsis syndrome make sense?
    I’d say yes. Sepsis is known to be a heterogeneous disease, with patients presenting with a wide range of symptoms and severity. The most recent consensus task force defined sepsis as infection causing a dysregulated host response, and this dysregulation they described in terms of the measurable organ dysfunction. However, within each of these components you can see where the heterogeneity arises. For infection, there are numerous types of organisms (bacteria, fungi, possibly virus), species of organism, and locations in the body (e.g. urinary tract, lungs, skin) where they may infect causing different pathophysiologic changes. For organ dysfunction, which organ is affected can have a very different physiologic impact that may aggravate the disease (e.g. gastrointestinal disruption may increase endotoxin release) or modify treatment effectiveness (e.g. increased capillary permeability shifts Starling forces, impacting hemodynamics during resuscitation). My bottom line: phenotypes for of infection, manifestations of organ dysfunction, and different patient responses should be expected. I think the uncertainty inherent to these patients is well explored in this editorial.

  2. Can phenotypes be identified statistically?
    If we accept phenotypes should exist, then the question is how to identify them. I think the goal would be to identify what physiologic differences are occurring in each phenotype as this may help clinicians better understand the disease process, therefore design targeted interventions. But here is where I need help from Statisticians / Machine Learning experts - how do the approaches these authors use identify phenotypes, and what are the assumptions inherent to these approaches? The word “classification” is something I am wary of as breaking patients into groups based on characteristics common to that group may not translate to actual underlying physiologic differences in their disease that are more likely to predict treatment response.

  3. How should phenotypes be incorporated into future studies?
    Assuming phenotypes do exist and have a biologic basis, I see two problems for studies. The first is to accurately identify which phenotype a patient belongs using available information. I wouldn’t advocate for specific criteria - as we know, anytime a threshold is used to determine abnormal from normal in a continuous measure, there is inevitably information lost. But in order to use these phenotypes in future studies, we need some sort of approach to “classify” patients without adopting this harmful practice. Perhaps there is a probabilistic approach, appropriately weighting the characteristics the authors described for each phenotype? The second problem is how should these classifications be incorporated into future statistical models. Should cohorts be restricted to only patients of one phenotype? This will reduce generalizability to marginal patients between phenotypes. Should phenotypes be entered as covariates to adjust for their baseline risk for the outcome? There is a risk of incorporation bias with this approach as clinical variables are likely necessary to define the phenotypes.

Looking for comments on any of the above!


i’m glad you bring this paper for discussion on datamethods. My initial feeling is that identifying phentoypes has value if it informs handling of the patient, but i see little evidence of that in the paper(?), and the claimed differential treatment effect, and i’m not persuaded by the attempt at reproducubility. But i need to look carefully at all their supplemental docs to see exactly what was done…


This is a really interesting paper, but as Paul has commented, validation is key. Just last month we saw a great paper from the Exeter diabetes consortium ( showing that a previous machine learning algorithim was outperformed by simple clinical features.

I’m not sure this is suprising. In my clinical practice (Infectious Disease/Microbiology) I can describe cohorts quite easily:

  1. old and likely going to die - regardless of iunsult
  2. young and profoundly septic
  3. young with marked infection but not septic

I sort of get the impression (this is my experience), that there is a difference between gram negative sepsis and gram positive sepsis, but the cohorts that get these infections are fundamentally different (young, intravenous drug users get much of the MRSA etc I see).

There is clearly some value in defining these phenotypes (in fact, we did, prior to everything being rebadged as sepsis about ten years ago), so this is the classic split or lump problem.

Is this going to help clinically?
We already given antimicrobials quickly (which probably help in 2,3, but don’t help in 1).
I’m not sure this will add much. We can already predict sepsis mortality reasonably well (age + comorbidity + physiological illness)

The real issue is the entry criteria. It seems slightly bizarre to condense a huge amount of information about a patient to sepsis/not-sepsis (using qSOFA or whatever), and then try and split these into seperate phenotypes? Why are we using the sepsis definiton at all?

Why not use pneumonia, UTI, cellulitis, meninigits, etc? The differences between a patient with pneumonia and UTI are profound, it seems bizarre to lump them both as sepsis, then split again.


thanks for linking to the lancet paper, i hadn’t seen it. It makes the sepsis paper feel all the more tentative. The jama editors say the usual things eg: “future investigation will be needed to determine whether these four phenotypes will be useful in clinical research and practice”, and that should be emphasised.

when the lancet paper concludes “however, simple clinical features outperformed clusters to select therapy for individual patients” i can’t help but think of something Frank Harrell said which is always in the back of my mind: “In the quest for publishing new ideas, measures of added value are constantly being invented by statisticians, without asking whether older methods already solve the problem at hand”

actually, the van Smeden, Harrell & Dahly letter in response to the original lancet paper (Ahlqvist et al.) seems relevant here eg “cluster-based approaches to describing dependency should draw substantially on existing theory to support the claim for having identified true groups”

the difference with the sepsis jama paper is the almost excessive self-scrutiny regarding the clusters they derive (the paper has eg Scott Berry on it, thus i’m inclined to trust the stats). Yet the editors refer to 4 ‘distinct’ phenotypes and believe they say they represent ‘unique clinical groupings’. How is ‘distinct’ defined? I think i’ve heard Frank Harrell talk about the ‘distance’ between clusters and that this is the important thing

the jama authors used several techniques including latent class analysis (LCA) and RCT data simulations. I’m not persudaded by the data simulations; as with most of the paper it feels like if the results were actually compelling then you wouldn’t need all this sophisticated stuff to back it up. I once asked Frank (at the course he presents) about LCA, he said something like “i don’t know much about it, but i tend not to believe something described as latent”, fair point

personally i wonder, always, about practical implementation. They reiterate that data are from “routine clinical info at hospital presentation” which is a good start. But the four groups are based on 29 variables (see eg Fig2)? You need to input data for 29 variables to derive the group for the patient??

are they entering 29 values by hand into a black box? maybe i misunderstand, but if not, then what are the consequences of misclassification due to data quality?, data entry error? or coding error? ie if this affects patient handling. The diabetes lancet paper (mentioned above) that also used clustering methods used just 6 variables

i think statisticians are at their best when resources are limited and we need to make the most of a small amount of data, it leads to eg novel study designs in pediatric pops. But when we have massive datasets of questionable quality and there’s a similar desire to do cool things and publish, it leads to this kind of thing… it feels too contrived and speculative to me, i especially don’t like the ‘unsupervised’ data-driven methods, the enormity of the data and the sophistication of the methods may lead to unjustified confidence

but i’m still reading the detail of the paper, and ought comment further when i know it fully…


Agree with “contrived” but would like to suggest that what this type of research lacks is precisely the “speculative” (aka conjectural) quality that @Gus_Hamilton’s rich and interesting clinical theorizing/speculation (above) offers us. What ruins the credibility of research of this kind for me is its disregard for the theorizing (‘priors’, if you like) of intensivists who have grappled longitudinally with hundreds of such cases. Instead, a kind of magical thinking is indulged, wherein pure technique is supposed to automagically deliver knowledge straight out of our data.

Here, I see a similar prejudice. We believe in all kinds of ‘latent’ entities, of course, that do not present themselves directly to our senses. (Atoms, for example.) The clinical sciences fall short of more mature fields, however, in failing to subject theorized latent entities to repeated, severe probing. (In the psychological sciences, for example, we find that ‘construct validation’ has come to be understood as a one-off check that definitively validates a construct—instead of the never-ending testing of living ‘nomological networks’ envisioned e.g. by Cronbach & Meehl (1955).)

thanks for joining the discussion… id like to respond with something meaningful when i have time, for now i would just note that the feeling elicited by the lancet paper Gus linked to is same-old-same-old. Hype and hope and hypotheses are quashed daily by statisticians who are forced to play bad cop, so you must forgive us… :slight_smile:


I should have made clear that I meant Gus’s own conjectures, especially:

I’d hope statisticians would prick up their ears when they hear clinicians talking like that!

This point identifies the flaw in the system wherein clinicians and statiticians and now ML operators are siloed and often know little about the others domain. Here we see clinicians seeking phenotypes as a function of a specific technique.

Yet the problem is more fundamental. Of course there are phenotypes. That sepsis was even a unified condition was guessed in the 1980s. A set if thresholds were guessed for the first large RCT in 1989. These were later codified to become the sepsis criteria a set of guessed thresholds modified by consensus every decade. If 2015 the SIRS based criteria were replaced with the SOFA based criteria. This decision was based on the Delphi method. The SOFA threshokd set was also guessed in the 1990s but is radically different then original SIRS thresholds.

Living in another silo and lacking knowledge of the validity and origin of the criteria for sepsis statiticians have dutifully applied math to a contrived (guessed) cognitive bucket of different things. The hidden mix of phenotypes, combined into a single condition as a function of cobbled up thresholds, changes with each study so does the HTE. Therefore as we could predict sepsis RCT applied in thus way have not produced a single reproducible positive result for 30yrs.

You could identify the problem as naivete in the clinical camp or lack of domain knowledge in the stat camp but this is a type of “blind handoff” error where incorrect assumptions of fundamental validity abound causing pathological science.

Yes the fundamental pathology of the science was to use guessed thresholds as RCT gold standards but
the error is propagated when statiticians fail to ask the source and validity if the standard before applying math.

This is a 30 yr RCT catastrophe on a grand scale. It will be repeated unless studied.

If anyone would like, I can discuss some of the strengths and weaknesses of the phenotypes and their derivation.

Look forward to thoughts.

Somebody needs to write a paper about the need to bridge siloes.


Here you see the differences in mortality of controls in sepsis RCT. One explanation is that the mix of phenotypes is different across the trials. Another is that some of the subjects dont have sepsis (whatever that is) at all but rather other clinical conditions (like mucopurulent bronchitis with exacerbation of COPD) which met the threshold set of the “sepsis criteria” and so entered into the RCT.

Indeed the reason the SIRS based sepsis criteria was abandoned in 2015 was that it had a poor specificity for sepsis. Let’s say that again. The gold standard of the RCT was found to be poorly specific for the very condition which everyone thought was under study in the RCT.

Now the statiticians didnt know any of this but why not. Had they been less trusting a few decades of research could have been saved. That they are also responsible is unequivocal as they were not blind to the changing of the guessed threshold sets and tolerant of well intentioned but nevertheless capricious and arbitrary dichotomization.

Each time a threshold is added to a set (which set generates a positive result as a function of the sum) it adds heterogenity. This dependency of the heterogenity on the parameters defining the standard criteria is what I call the overt heterogeneity function of the RCT “gold standard”.

If statiticians routinely estimated the overt heterogeneity function then they would have a mechanism to automatically asses the validity of the standard.

This is pivotal because no one seems to have the courage to address this issue of pathology in sepsis science straight on. Statiticians have argued (and effectively) that this is not their lane. They are not empowered to question the clinical standard.

So I am suggesting a math tool for you, (the overt heterogenity function) to provide a non clinical (math) basis to formally question and asses the validity of clinical standards applied in RCT. That way you do not challenge their domain but rather teach them how their domain must be defined mathematically to render valid RCT.


Here is the link to the article with the above figure. The recognition that massive sets of RCT in sepsis and sleep apnea are based on guessed threshold sets and are non reproducible is probably stunning to all of you who are used to titrating the nuances of the trials to optimize them and no, adaptive trials using guessed thresholds as standards will not solve thus problem.

For some in the field there will be a sense of panic. What do I do as this is a politically sensitive issue?

All I can say is that either the statiticians or AI are the only hope and we are going to do it if you do not.

They cant because the clinical sepsis trialists cannot abandon the 1990s SOFA threshold set they Delphied out of mothballs and anointed only a few years ago. “Though they may lose faith they will not abandon the dogma which led them to crisis”(near quote, Kuhn)

1 Like

it is obviously not a lack of courage, since it requires no courage whatsoever. id consider alternative explanations for the statistician’s reticence. Personally i leap from paragraph to paragraph expecting to stumble upon the seminal, salient point, but then i reach the end and wonder what ive just read. There is no silo, i’m sat next to a doctor. I hope another statistician can assist you in your mission

1 Like

Well it is hard to grasp. Unbelievable really. 30yr of non reproducible RCT in sepsis. That you dont grasp that is understandable. I couldn’t believe it myself at first.

The salient point is clear. The gold standards are guessed sets of thresholds So the RCT are nonreproducible. If that’s not clear and you have interest then I you should research the domain. It will open your eyes.

1 Like

I will leave your forum now. Just had to make things real. I feel sorry for those who thought like I did that science is always real. I learned sometimes is pathological (Langmuir) and propagated pathological science is fake science.

That was a sad point in my life when I realized that scientists would do science they knew was flawed out of expediency and they stay silent out of fear.

None of the typically vocal in this group will rise to challenge me nor will they bring a sepsis thought leader to publicly rebut my points.

That’s because what I said was true but they cant admit that.

Someday when you are older you will learn that sometimes science is fake, and you will remember these words.

Goodbye all.

1 Like