Practical challenges of individualizing prognosis/risk

Hi Everyone,

First time poster and look forward to engaging in the community here. This post is an extension of a Twitter discussion related to individualizing risk estimation. As a way of introduction, I am a final-year trainee in emergency medicine and recently completed a MSc in evidence-based health care, performing a systematic review/meta-analysis of the HEART score for my dissertation. Through this work, I have gained an interest in prognostic research. I have come to really like the idea of prognosis (i.e. what is likely to happen) being a guide for clinical practice for many scenarios (more reading:, with a particular emphasis on patient-centred and system-centred outcomes. It is my hope that with a shift from diagnostic to prognostic thinking for certain clinical scenarios (i.e. where the diagnostic reference standard is vague), we can begin to address issues related to over-testing (i.e. exposing a patient to the harms of a test when it is unlikely the result of the test will benefit the patient) and over-diagnosis (i.e. labeling a patient with a “disease” when this label does not benefit the patient or perhaps even could cause harm) in clinical medicine.

One of the things I struggled with throughout my dissertation was how the clinician can practically individualize a risk estimate to the patient in front of him or her to guide a clinical decision. There are many issues related to critical appraisal (i.e. external validity and internal validity, particularly spectrum, incorporation and verification biases) and the specifics of a clinical scenario that can make having confidence in a precise, individualized risk estimate challenging. As a result, in practice, I will typically present prognostic evidence to the patient as “population-level” evidence (i.e. “The best available evidence we have suggests patients just like you or similar to you have an approximately X risk of outcome of interest”).

I will use the HEART score as a potential spring board for discussion. The HEART score is interesting in many ways. Firstly, it was “derived” by a group of physicians in the Netherlands who believed the chosen prognostic factors and their chosen weights, based on clinical experience, would predict the risk for “major adverse cardiac events” or MACE, defined as death, myocardial infarction or need for coronary revascularization by cardiac catheterization or open-heart surgery, within 6 weeks. These authors did not “derive” the prognostic factors in a way I think many in this community would find to be robust, but it turns out that it has nonetheless caught on with researchers, with over 30 external validation studies as well as a regression analysis suggesting the chosen factors and their weights were appropriate.

The HEART score was initially conceptualized as a “diagnostic tool”. The intent was that it might help the clinician more confidently rule out acute coronary syndrome in the patient presenting to the ED with chest pain. Acute coronary syndrome is an interesting diagnosis in medicine. It lacks a clear diagnostic reference standard, with the most common reference standards being future MACE (i.e. 30 days to 6 weeks) or cardiologist adjudicated chart review, both of which are problematic and introduce importance biases, namely spectrum, incorporation and verification. I think as a result of this, some authors of those 30 external validation studies have looked at the HEART score more from a prognostic lens i.e. once the clinician has “ruled out” acute coronary syndrome from his or her assessment, how might I then estimate the risk of a bad outcome for the patient and then use this risk estimate to guide certain clinical decisions (i.e. does this patient need a stress test? does this patient need a cardiology follow-up? should I start this patient on aspirin pending further testing?).

I became somewhat fascinated with the HEART score due to how much it has been studied and talked about in the blogosphere (i.e. “FOAMed”). Despite the shortcomings (i.e. not “properly” derived, foggy diagnostic reference standard with important biases to consider), clinicians and researchers alike have embraced it. I am unaware of any other diagnostic/prognostic model that has been studied to this extent in emergency medicine. I think clinicians have embraced the HEART score because it is accessible and reflects the information we typically gather and analyze in our decision making in the absence of a formal risk model.

A patient can score anywhere from 0 through 10. For reasons that are not entirely clear, over the years, patients are typically grouped into the following strata: HEART score 0-3 (low risk, approx 2% MACE at 6 weeks), 4-6 (intermediate risk, approx 15% MACE at 6 weeks) and 7-10 (high risk, approx 50% MACE at 6 weeks). The idea was that patients in the low risk group would be appropriate for discharge from the ED, potentially without any additional follow-up aside from primary care provider, whereas the others may warrant an admission and, if not, urgent stress test and cardiology follow-up.

A few issues with the HEART score that I hope generate some discussion:

1 - Some HEART score studies organize their data in 11 strata (i.e. HEART score 0, HEART score 1, HEART score 2 … all the way to HEART score 10), whereas the vast majority organize their data in the original 3 low, intermediate and high risk strata described above. What are the pros and cons of each approach in guiding decision-making at both the individual patient assessment and system levels (i.e. locally, we are going to implement a policy where all patients with HEART score 4 or greater will be prioritized for stress test and cardiology follow-up within 72 hours)? It seems to me it is very unlikely that a patient with a HEART score of 0 or a HEART score of 3 have the same risk, but unfortunately we do not have the data stratified to know the risk difference between 0 and 3. But, in the grand scheme of things, does it matter?

2 - Unfortunately, HEART score studies do not provide information on time to outcome data. As an emergency clinician, I am more concerned about a patient I send home to have a myocardial infarction that evening, rather than one who has a cardiac cathetherization at 6 weeks, which was likely the result of an abnormal stress test that I arranged for the patient in the first place. As a clinician, how does one attempt to cope with a lack of time-to-outcome data? Knowing when the events occur is important, no?

3 - How does one summarize the prognostic data, both in research and when talking to the patient? Interestingly, a recent HEART score systematic review/meta-analysis opted to summarize “prognostic accuracy” in terms of sensitivity and specificity: What is the advantage of thinking about prognosis in terms of sensitivity and specificity? What is wrong with absolute and relative risks? I think this review reflects some of the confusion among the 30+ external validation studies, where there is variability in how data is summarized.

It seems, in emergency medicine literature, there is a tendency to use sensitivity and specificity for prognostic clinical questions. I think it comes from the notion that, at least mathematically, sensitivity and specificity should not change with the prevalence of the outcome, and can be used to generate likelihood ratios, which are then applied to a patient’s pre-test probability to generate a post-test probability. But, isn’t the HEART score attempting to estimate a patient’s pre-test probability prior to stress test or coronary angiogram? Is it right to conceptualize the HEART score as a “test” that has false negatives and false positives? How do you explain the concept of a “false negative” HEART score to a patient?

4 - In prognostic accuracy systematic reviews/meta-analyses, it is my view that authors should be challenged to perform subgroup analyses to confirm the prognostic tool performs similarly in diverse populations with diverse baseline risks of major adverse cardiac events. In the absence of this analysis, how does one even begin to attempt to individualize the risk assessment of the patient in front of him or her? This is one way a review can attempt to address the external validity question. But practically speaking, this also raises the question of how a physician attempts to cope without knowing what his or her local event rate is. This is another challenge in individualizing risk estimation.

5 - How does one cope with biases in prognostic research? Can one “adjust” for these biases? For example, a patient with a higher HEART score is more likely to have multiple cardiac troponin levels and/or a stress test performed during a hospital observation period, whereas a patient with a lower HEART score is more likely to have just one troponin be performed without any stress testing. As a result, the former patient is at increased risk of a MACE being detected solely due to more testing occurring. And the latter patient is at increased risk of a MACE being missed due to less testing occurring. Does this matter when a clinician attempts to precisely estimate the risk of a patient sitting in front of him or her?

A lengthy first post, but look forward to any discussion that ensues. Though I have used the HEART score as a way to illustrate certain concepts, I believe these concepts apply to diagnostic/prognostic models in general. As a result of these issues, this is why I always present risk estimates to a patient as “population level” evidence, rightly or wrongly.

  1. I think the biggest issue limiting predictive value of HEART score is categorizing of variables such as age, troponin, etc. Especially age, which is usually one of largest weights for prediction purposes; it should be kept as continuous data.

  2. How we present the scoring system depends on use. If we want a quick cognitive aid, simple categories seem attractive. However, I think it would be better to present data as probability or probability range for a specific outcome or event. Traditional risk scores that rely on cognitive calculations cant do this, but computers are ubiquitous and the possibility exists. As an aside, its odd for me as an EHR specialist to program a risk score calculation that categorizes age, blood pressure values, etc from EHR data, when instead I could directly incorporate a formula that allows appropriate use of all the continuous data.

  3. Their are issues with assessing outcomes for chest pain. As you pointed out, time frame is not always consistent. Certain outcomes like re-vascularization or stress testing may be poor surrogates of actual events/disease, and rather a signal of over-use or over-testing. I imagine if individual level data is present from the original studies, one could pre-specify multiple different outcomes and provide probability of events for each. But I am unsure and would like to here more from others.

  4. Presenting data to patients is both simple and impossibly complex. I think our ethical duty of informed consent means providing the best post-test probability with appropriate context and in consideration of patient values. I personally like to use frequencies in addition to percentiles, along with a general range of uncertainty. I usually dont explain how I arrived at the calculations (to be honest, I dont really know half the time myself), but usually state the risk factors that lead to the post-test probability. I also emphasize how I or the prediction could be wrong and the degree of uncertainty or confidence. I rarely ever use any technical terms.

  5. This is perhaps the most interesting part of the discussion, and worth focusing on. We should clarify that for any intervention, their is a difference in heterogeneity of relative risk reduction, and heterogeneity of absolute risk reduction. I think you are asking about heterogeneity of relative risk reduction; that is all other measured factors being equal, could one patient have another variable (subgroup) that affects benefit/risk-reduction from the intervention? I think this is a valid question. My understanding is that heterogeneity in relative risk reduction is somewhat uncommon, and subgroup analysis is prone to error and should be done cautiously. Their has been a great discussion on twitter on this topic. I think their are also some other threads on datamethods on subgroups, but it may be worth creating another.
    In general, most of us are more interested in heterogeneity of absolute risk reduction from an intervention, which will vary in each individual based on their base-line risk (or local event rate). We sort of assume the relative risk reduction is constant. How do we assess baseline risk prior to an intervention? Also a good question, and I would love to here from others.

  6. I would say before discussing biases, we should focus on providing the best possible prediction. Rather than the HEART score (which has room for improvement as noted above), I would suggest looking at the ASCVD risk calculator as an example. It is an good tool, validated, and now standard of care for providing individualized predictions for each patient to help with informed decision making on statin therapy. If more prediction models are like the ASCVD calculator rather than the HEART score, I think we can feel better about having less bias and better predictions.

Chris this is very well set up and thanks also for the great background information. Thanks Raj for your excellent contributions.

Before adding my own $.02 worth I want to give a little background on diagnostic risk/prognostic estimation. There are at least four sources of information for making these assessment:

  • Physicians can make subjective estimates based on their total understanding of the patient’s condition and experience of other patients. This works best when their estimates are collected and patients are followed to determine their ultimate outcome status. Then the relationship between the physician’s prognostic estimates and the outcomes can be smoothly estimated to create a calibration function that can guide the physician in improving her estimates.
  • Prospective cohort studies including patient registries, with well-defined data elements and minimal missing data.
  • RCTs, if patient entry criteria are not too narrow. When using the same RCT data to estimate treatment effects as well as to estimate baseline risk (usually this is risk in the control group) one has a distinct advantage—the same covariates can be used for prognostic estimation as were used to adjust the treatment effect for outcome heterogeneity.
  • EHRs, in the somewhat uncommon situation where the needed variables were pre-specified then independently checked for existence and near completeness in the EHR, a cohort can be well-defined, relevant outcomes are assessed on everyone in the cohort, and treatments used are well documented and understood.

Now some particular comments.

This common practice is particularly concerning and represents de facto a declaration of war against the original HEART score as it if were somehow defective. To even consider further binning of an already binned scale, researchers would have to show that patient outcomes within each of the 3 strata are the same regardless of the original HEART score. But then the HEART score should not have been created the way it was derived. I cannot fathom why clinical researchers cannot deal with the original HEART score with only 11 distinct values. After all they handle blood pressure, heart rate, respiratory rate, etc. without difficulty. I would be shocked if “dumbing down” HEART does not worsen decision making. I would also be shocked if the further binning were driven by anything other than ad hoc thinking, devoid of data.

Regarding 3, this harks to Raj’s comment that a risk estimate would be more valuable than a unitless 0-10 score. Without that, perhaps you can, among other things, tell a patient that you are at level 3 on a scale that ranges from 0 to 10. And sensitivity and specificity are only relevant for a retrospective study, because otherwise they condition on the future to predict the past. I have more than one blog article about that. Also note that sensitivity and specificity are functions of patient characteristics, a problem seldom recognized. Thus they fail at the only advantage they would have had—to simplify the model using two universal constants. Also, I’d avoid concepts such as “false positives” and “false negatives” — these are not needed if all levels of the score are retained (and even less if you use risk estimators). A side issue is that crudely grouping the score, just as with grouping continuous predictor values into bins, results in loss of information that is equivalent to throwing away some of the data that were carefully collected in developing HEART.

On point 4, when validating an integer score such as HEART, one mainly needs to relate the 11 levels to the average outcomes of a large number of patients at each of the 11 levels. Some smoothing would be OK. On the broader issue of risk model validation, there are three primary quantities to consider:

  1. Calibration in the large: how close to the mean outcome of a large group of patients is to the overall crude mean of the predicted risks.
  2. Calibration in the small: use a smooth nonlinear function to relate the individual risk predictions to the individual patient outcomes. This results in a smooth calibration curve, and no binning should be attempted.
  3. Calibration in the tiny: same as 2. but stratify by levels of individual patient characteristics. For example when we predict a risk of 0.3 in a male are we right, and when we predict a risk of 0.3 in a female are we right? Calibration-in-the-small ignores the sex stratification.

The vast majority of the time we are content with calibration-in-the-small. Once you demonstrate this calibration curve to be close to the line of identity (e.g., compute the mean absolute calibration error and the 0.9 quantile of the absolute error), it is fairly safe to use the risk predictions in a somewhat diverse patient population, assuming the validation sample was almost as diverse.

When is it OK to use a risk model that is not extremely well calibrated? When using the model would still improve decision making over prevailing practice. Without studying decision making in detail, I’d argue that a risk model can be useful when the prognostication alternatives are all worse. In other words, a statistical tool does not have to be perfect; it only has to be not easily beat by a competitor.

There is a separate area I’ve written a couple of blog articles about: combining relative efficacy estimates with risk estimates derived as described above. Unlike Raj’s implication, the literature has found remarkable constancy of odds and hazard ratios over patient characteristics. Heterogeneity of treatment effects has not been demonstrated (in a validated way) very often, when the right scale is considered. The constancy of relative efficacy allows one to insert the effect ratio into simple equations to estimate absolute treatment benefit once the baseline risk model is settled upon (and validated).

I don’t find the appeal to a population to be a particularly helpful distinction in this context, when in effect the risk model has anything other than an intercept in it. I’d like to separate the reliability of the risk model from its underlying meaning, if possible.

In the SUPPORT study we found lots of anecdotal evidence that mentioning groups made decision making worse. The most common example was of this form: In the end of life decision making process, a common prognostic estimate of these critically ill adults with 6 month life expectancy was an estimate of a probability of 0.2 of surviving two months. We had the experience of patients saying “I just know I’m that one out of five who will make it.” This is really a misinterpretation of probability (all “five” such patients to the best of our knowledge had an equal 0.2 chance of surviving; put five patients in a room together and see if one still wants to step forward as the favored one), and we should have found another way to suggest that physicians discuss this with their patients (including the use of pictorial representation of risk or making risk comparisons, e.g. what is your risk of dying were you to take a 100,000 mile trip, or what is the risk of dying within two months for someone of your age drawn from the general population). By leaving it open for the patient to consider some imaginary group of like patients, we left things open to misinterpretation. In this particular example, the typical median life length was in days. Had we given such time estimates instead of survival probability estimates, I feel that study endpoints would have shown an effect of the randomized intervention, on endpoints such as not waiting so long for DNR discussions.

I look forward to more discussions.