Identifying patients through phenotype definition

Hello everyone,

I was reading how electronic and ML based phenotyping are used to identify patients in a EHR database. To that end, I came across the T2DM Phenotype algorithm in PheKB website click hereand have few questions.

I am trying to learn what are phenotypes and how it works etc. Not from healthcare background. So would really appreciate if domain jargons are minimized

Can you help me with the below questions?

Let’s breakdown the flowchart (pasted below) into individual paths. Flowchart is pasted below if that’s easy to interpret for you.

path 1) EMR -----> T1DM DX -----no–> T2DM DX ----no—> RX T2DM MED —yes—> Abnormal Lab —yes----> CASE

Q1) - How can the patient be prescribed for T2DM meds when he is not diagnosed for T2DM?

path 2) EMR -->T1DM DX —no—> T2DM DX ----yes—> RX T1DM MED —no—> RX T2DM MED —no----> Abnormal Lab —yes----> CASE

Q1) Based on path 1 and path 2 what I can infer is, if the patient has abnormal lab, he is a T2DM patient irrespective of whether he was prescribed for RX T2DM meds

path 3) EMR ----->T1DM DX —no----> T2DM DX —yes—> RX T1DM MED —yes—> RX T2DM MED ----yes—> T2DM RX precedes T1DM RX —yes—> CASE

Q1) Again, how can a patient be prescribed with T1DM meds when is not diagnosed for T1DM ?

Q2) It makes sense to see T2DM meds as patient was diagnosed for T2DM but why do we see it as T2DM RX precedes T1DM RX mean? Shouldn’t it be T1DM RX precedes T2DM RX? Based on workflow that we have. I did a google search and found out that T1DM usually occurs before once enters adulthood. So Am I misinterpreting this?

path 4) EMR ----->T1DM DX —no----> T2DM DX —yes—> RX T1DM MED —yes—> RX T2DM MED ----no—> T2DM DX by Physician >=2 —yes—> CASE

Q1) In this path, a patient is diagnosed for T2DM but he has been prescribed T1DM meds. Is this possible and when can this happen?

Q2) Similarly, though the patient is diagnosed for T2DM he isn’t prescribed for T2DM Meds. Is it like patients are prescribed meds only after multiple visits or something?

Can you help me with the above questions please? Will really be helpful

In an AHIMA “Perspectives in Health Information Management” article, Verchinina et al (2018) say this:

A phenotype is something that can be observed. The suffix -type suggests a typology, or classification scheme. Hence, a phenotype is a way of classifying organisms, including people, based on characteristics they have that are observable and measurable.

Building on the definition of phenotype above, the term computable phenotype refers to a shareable and reproducible algorithm precisely defining a condition, disease, complex patient characteristic, or clinical event using only data processed by a computer, principally EHR data.2 As a shareable and reproducible algorithm, a CP is a tool that can be used to identify patient cohorts with a specific medical condition of interest associated with a specific set of observable and measurable traits. Further, because they are shareable and provide formal specifications of diseases and events, CPs can serve as standards, allowing patient cohorts to be compared and combined more easily than they are today.

Geisinger says:

Defining traits of patients (disease, treatment, response, etc.) in an iterative and collaborative process between programmers, investigators, and/or clinicians.

And further adds that the computer phenotype should be validated by chart review.

So a phenotype is a definition of some cohort of patients of interest to someone (a clinician, a researcher/P.I., a clinical process improvement team, a pop health group, your boss, you) and the computable phenototype is basically the heuristic rules, or recipe, for identifying those patients of interest, in the murky swamp that is your data environment. (Where “patient” is a human with a longitudinal history, rather than a single encounter.)

The iterative part of the workup is because usually some trial and error is needed to find the Goldilocks happy spot of Type 1 & Type II errors … i.e., not missing (many) patients the instigators think should be counted, and not counting (many) patients that the instigators think are not truly of interest … given that some of the “evidence for membership” (my term) may be fuzzy, missing, ambiguous, and/or conflicting.

With medical stuff, no matter the source of your data, you are at best only going to see some slice of the patient’s true condition. If you are working with provider-side data, you probably see only the encounter records for visits to your system’s affiliated physicians and facilities. (For example, I’m in Florida … we have a significant number of patients who are snowbirds from northern states; half the year they get their medical care at other providers, a 1000 miles or more away.) And you may or may not be harvesting 100% of the information stored in the EHR … can you guarantee your EHR text-parsing is correctly recognizing and categorizing every word, phrase, abbreviation, and “rule out”? On the insurance side, you only see what comes through on claims. If no claim is submitted … e.g., it was cheaper for the patient to pay cash out of pocket than their copay would have been … the encounter is invisible to you.

I think the key insights into understanding the rules are (1) that the rules have to work with the data you have, which is NOT the complete set of all the information you’d logically expect or like to have, and (2) that the variable or missing part can vary from patient to patient.

So … onto the DM rules you’ve quoted. I’m not a diabetes expert by any means, but we can reverse-infer some of the logic.

Rule 1 says that a DM med confirmed by a abnormal lab result counts, even in the (apparent) absence of a diagnosis. Why didn’t we find the Dx for DM in the EHR? I can’t say offhand, maybe you don’t have records from a bunch of the primary care physicians, in the time period you’re reviewing. It’s also possible this is a failsafe rule that only happens in 0.1% of cases we should find.

Rule 2 is another double-conditions rule, this time that a Dx + abnormal lab result should be counted as qualifying. This could be patients who have not been previously identified as having diabetes, patients who have been identified as having DM but aren’t “under management” for some reason, or patients who are under DM management by a PCP or other providers whose routine records aren’t in your data universe. Or other data holes.

Rule 3 says that, for some reason that is not made explicit in the rule itself, violations of the expected chronology of DM1–> DM2 don’t matter, in the presence of a history of meds for both. There may be some other rules, not shown, that toss out some candidate patients with slightly different factors. Or perhaps some drugs are used for both DM1 and DM2 and thus don’t clearly differentiate the DM type.

[Edit: I’m not clinical. My thinking when I wrote the above is that DM Type 1 is relatively uncommon compared to DM2, but that most patients with DM1 progressed to DM2 in their later years. That may not be true.]

Rule 4 is a nuanced version of the other rules. Perhaps the intent of this rule is only clear if all the “Not a Case” rules are also presented, in hierarchical order of execution.

Note that it doesn’t appear from what we see in these rules that the phenotype definition is trying to separate DM1 from DM2. Perhaps at some point that was attempted, but was deemed not successful enough to be usable. If that was the history, the rule stack could have some “relict rules” that were re-purposed late in the development process of the rules for this phenotype.

Does that help? Sorry about the length.

If you are adopting phenotype rules invented elsewhere, you’ll probably want to root out provenance documents explaining exactly what those rules are doing, expressed in clinician-to-clinician language. Otherwise you’ll have to work with your clinical audience/users to validate the rules against your own data. (And even with doco, you may still have to do that.)



Hi Doug Dame,

Thanks very much for your detailed explanation. This really helps.

I’m currently involved in a similar project where we identify (incident) T2D cases in EHR records. To start, you really have to keep in mind what @Doug_Dame highlights: EHR data are error prone/incomplete for numerous reasons (people move or live in 2 places, practitioners don’t always write everything down or code events/diagnoses/visits correctly, and so on). You will therefore always have individuals for which the diagnosis is not completely sure, but in absence of other data it is the best you can do. For your analyses however (I’m assuming you are going to perform some analyses with these data), you can decide to label certain and uncertain cases differently and perform your analyses including/excluding uncertain cases.

For your practical questions on T2D identification, I think it would be good to read up a bit about T2D diagnosis and treatment if you are going to pursue this project further. The wikipedia page on T2D could give you some of the answers you are looking for. Lets start with the formal diagnostic criteria. In most countries, the WHO criteria are used:

  • A repeat measurement of elevated fasting glucose concentration (2 measurements ≥ 7 mmol/L)
  • Oral glucose tolerance test with a glucose of ≥ 11.1 mmol/l after two hours

As for medication: there is not really such a thing as T1D specific medication. T1D is treated with some form of insulin (more often with a insulin pump), but insulin is also given to individuals with T2D. The treatment path for (almost) type 2 diabetics is generally as follows:

  • Lifestyle (a period of lifestyle intervention with diet and/or exercise is usually initiated first in an attempt to combat overweight/obesity and improve insulin sensitivity, but in severe cases medication treatment can be initiated at diagnosis)
  • Non-insulin glucose lowering medication (this used to be mostly oral glucose lowering medication such as metformin, but in recent years injectable medication such as SGLT-2 blockers have become available, these are not always included in all treatment regimes and sometimes follow insulin treatment, but generally you receive non-insulin glucose lowering medication before initiating insulin)
  • Insulin and/or insulin analogues

On to your questions:

In general, you can’t and as @Doug_Dame points out, this could simply be caused by missing records or somewhat error containing record keeping. However, some medications (such as corticosteroids) can cause elevated blood sugar and individuals receiving long term treatment with such medications are sometimes also prescribed glucose lowering medication.

Abnormal lab is pretty much the T2D definition and as you can see from the treatment path not all T2D individuals receive medication straight away.

As discussed, T1D specific medication does not really exist. I’m going to take a guess here that they define T1D medication as being insulin or insulin analogues with T2D medication being all the others. If so, path 3 basically says: T1D: no, T2D: yes, T2D medication: yes, insulin: yes and this person got non-insulin glucose lowering medication before insulin --> then this person is a case. This is actually what we expect, because this is the regular treatment path for T2D (lifestyle -> non-insulin medication -> insulin)

Multiple answers are possible. Again you might be missing records on prescriptions so individuals get both types of medication, but you just don’t know. Secondly, again there are no T1D specific medications so these individuals probably just receive insulin. If the records are correct, this can happen for several reasons. People can stop using them because of medication side-effects or allergies for example or because they can be harmful in individuals with reduced kidney function (which frequently occurs in T2D).


Hello @scboone - Thanks. It helps a lot. Much appreciated

I have just one question. Can I confirm my understanding based on online reading that it’s not possible for a person to have T1 and T2 DM? T1 usually occurs early whereas T2 occurs later in life. Atleast after 30-40. I understand the medication overlaps. But a person can’t be suffering from both at any point in time? Am I right to understand this way?

It’s rare, but possible. To understand how you should return to the underlying pathophysiology of T1 and T2 diabetes:

  • T1D is caused by an auto-immune destruction of beta-cells in the pancreas that produce insulin. The body can’t produce (enough) insulin which is why treatment consists of insulin injections as the physiological response is intact. Indeed it often occurs at a (relatively) young age.
  • T2D is caused by cells in your body becoming resistant to the effects of insulin. This is influenced by genetics but also by lifestyle/environment factors such as overweight and lack of physical activity. Hence, the sharp increase in T2D incidence in Westernized societies parallel to the increase in obesity (and sedentary lifestyle). The risk also increases with age and because of this was known as ‘diabetes of the elderly’ as it often manifested at later ages (from mid-age onward). However, because of the aforementioned lifestyle changes in Westernized countries over the past decades we often see T2D occur in much younger individuals as well. As mentioned in the previous post, we often try lifestyle changes first which sometimes suffice to improve insulin sensitivity (at least temporary). If necessary, we start drugs that decrease blood sugar concentrations through several mechanisms (metformin for example has several effects including a positive effect on insulin sensitivity). If these effects no longer suffice, we start adding insulin.

So in essence they are two different diseases with different causes which end up affecting the same pathways. Because of this, individuals with type 1 diabetes can simply develop type 2 diabetes as well if they have a genetic predisposition and/or a predisposing lifestyle. However, given the incidence of T1D varies between around 8-17 per 100 000 per year, the chance of having a substantial number of cases with both diagnoses is comparatively slim.


The root of your problem here doesn’t sound like it’s related to statistics. If there’s a clinical problem/question you’re trying to address using EMR data, I’d suggest that first steps should involve sitting down with at least one MD with relevant experience to address the following questions:

  1. Is the question you’re trying to answer likely to have a meaningful real-world impact?;
  2. Is the question you’re trying to answer likely to be answerable using EMR data? The idea that any clinical question can be answered with a big enough database is just not true;
  3. Is the EMR likely to contain accurate enough information for your purposes? As an MD, I cringe every time I see researchers making dramatic claims based on studies of EMR databases alone, since I know how little thought most clinicians put into their coding at the end of a long workday and therefore how inaccurate and incomplete these codes can be. You won’t understand the extent of this problem until you talk to people who actually enter the EMR codes (i.e., MDs);
  4. Finally, if the answers to questions 1 through 3 above are “YES,” then it would also be important to engage an MD (and others with relevant experience) in developing the best approach to get the information you need from the EMR database. Clinical knowledge would be needed for this step and the depth of knowledge you’d need to do this step well can’t be obtained from online searches of medical terms.

Here’s a good reference (already posted on another thread) that you might find useful.

Hope this is helpful.