What is a fake measurement tool and how are they used in RCT

I have linked an article below published this month. This provides more evidence that the measurement tools used for critical care RCT esp. for sepsis and mortality are often fake. Those who have studied this issue did not need more evidence but perhaps if enough of these studies are published the perceived bright star of SOFA will fade (as SIRS has faded). This study is no surprise given there has not been a single reproducibly positive sepsis RCT for 30 yrs.

Now fake is a harsh word and I dont use it lightly. We woud not hesitate to use that word if a homeopathic study was quoting some measurement standard that was guessed in the remote past.

So what makes a measurement used in RCT “fake”. One answer is the measurement is fake if it based on guessed thresholds. A guess is bad enough but it might be right. Guessed thresholds…I will let the statisticians tell us the probabilty that those are valid measurements.

Guessed thresholds cannot be used as a tool to measure other things or as a gold standard surrogate for a dependent variable in science. It does not matter how old the guess is or how many experts believe and promulgate the guess.

A guess does not become a measurement in science because it has “survived the test of time”. The history of science has shown that time is not a test.

I have heard that statisticians must use the tools they are given because its mandated by the PI or the reviewer. This argument suggests the statistician is a powerless worker who needs to use a potentially fake measurement, if mandated by the powerful, to do her work. Anyone reading this datamethods forum, and seeing the intellectual power here, knows that is nonsense. Statisticians are at the top of the RCT decision making hierarchy not the bottom, and for good reason.

Another argument has been that its not the statistician’s job to determine the origin or validity of the clinical measurement tool used. This “head in sand argument” ignores the fact that, in 2020, it is very easy to check the origin, history, and validity and its within the capability of statisticians to do so. It futher ignores the role of the statistician as the gate keeper of the RCT math, which I have pointed out before, is not comprised of siloed measurements disconnected from statistics but rather all the RCT math is an interdependent continuum.

So I have “enjoyed” (with some head wagging) all of the writings here discussing the meaning and the statistical math of RCT using SOFA and SIRS (guessed in 1989 and 1996 respectively). I’m reminded of the “deep” math of the Claudius Ptolemy. Were those fake measurements because the model was wrong? No. Good science is often wrong! Rather, they were fake because the model was guessed.

Now the paper I cite here below won’t be the end. SOFA will live on as a RCT measurement or endpoint because it has social momentum. The measurement of SIRS (guessed in 1989) is still used in research (as sepsis 2) but it is fading because it has lost social momentum. SOFA (guessed in 1996) is still going strong and used as a surrogate for sepsis as Sepsis 3. It has social momentum. Social momentum is the armor which deflects counterinstances.

Yet, next time someone asks you to do complex statistical analysis of data using SIRS or SOFA you will have to decide if you have the courage to ask for evidence that this is a valid, reproducible surrogate. The answer, and the studies they give you (if any) to support the measure will be more enlightening than the math, if you still choose to do it.


I have shown in this forum that the issue of sepsis detection has become highly politicized and dominated by personalities & expert opinion, not science. Some (for example ACCP) favor SIRS (from US about 1989) , others SOFA (from Europe 1996) and still others qSOFA (US about 2015). I could see intellectual camps favoring different scores if they were close (like two methods to measure temperature) but note in the prior paper how different they are.

The fact that each of these highly disparate scores has its own zealous followers shows how cult like this portion of sepsis “science” is.

A place where the experts are comfortable simply making up their own gold standards is not a place for those seeking acceptance of real discovery based on science. It’s not a place for real statisticians who are trying to make a difference with their lives as they can easily end up embellishing empty pseudoscientific rumination.

In this environment there has also developed “low AUROC tolerance”. Note how they say a AUROC of 70-79 is “adequate” and 80-89 “sufficient”. I’m not sure there is a difference between “adequate” and “sufficient” but I suspect they could not bring themselves to say an AUROC of 80 was “good”.

An environment wherein an AUROC of 70 is adequate and the qSOFA (comprised simply of three thresholds (GCS, Resp Rate, and BP) is perceived as excellent, is not a place for serious research.

Sure, all these scores correlate with mortality but that is largely meaningless as as most threshold sets of ICU signals one can imagine (eg high bands, high neutrophils, low platelets, low albumin) correlate with mortality.

There is a pandemic. Sepsis kills COVID patients. This is no place for expediency over discovery. We can’t expect individual statisticians to hold the line if the stat leadership does not take the lead.

As always I’m here to be proven wrong in public but, if you cannot, then I hope some of you will take the next step.

Most respectfully,

1 Like

your clarity and honesty will be interpreted as cynicism. But i’m susceptible to this explanation of things, i made a similar argument here: The promise and pitfalls of composite endpoints in sepsis and COVID‐19 clinical trials

ie composite endpoints in sepsis will grow in popularity, as they have elsewhere, due to happenstance and repetition, and dare i say it, “Ioannidis envy”? ie a clinician’s desire to expand their influence into the realm of statistics, which would be reasonable enough if statisticians were inclined to make their case, but as you say “the statistician is a powerless worker”. Statisticians are human, they want to please. Eg, the very best thing the bayesians did to convert statisticians was to make it fashionable, now they talk about bayes with gusto and frequentists are regarded as uncouth, not open minded, an anachronism

anyway … researchers are now explicitly trying to push composites in sepsis without consideration for their measurement properties etc: Designing phase 3 sepsis trials: application of learned experiences from critical care trials in acute heart failure
why not, it improves one’s H index to publish things that people are publishing, and im sure they will succeed as researchers want to appear up-to-date and thoughtful


I’d be interested in comments on our longitudinal ordinal outcome approach as we’re developing for COVID-19.

1 Like

i quite like it for a phase II trial. But to what extent are researchers familiar with the stubborn disconnect between phase II and III results. Maybe they only recite the hierarchy of evidence: rct is definitive. It’s dispiriting in industry seeing drug after drug show promise in phase II and fail in phase III when we move to the definitive outcome. Ive developed a blind, hardened scepticism that im stuck with for life. Thus id want to see such an outcome with a warning label. Or it needs a little stats paper describing the concordance prob, interpretation, how to power on it, data simulations etc, and i think that has to happen before the rct results land in a medical journal. Although, i have written such a paper and i doubt anyone reads it How Do We Measure the Effect Size?. I think the problems will come when you show a difference on such an ordinal, composite outcome, everyone wants to understand it but if you analyse the individual components there is no power, this will generate much ambivalence, speculation, hesitation

1 Like

In terms of my field of rehabilitation, where our outcomes are overwhelmingly true ordinal variables improperly treated using parametric approaches, I’d say your method opens up a large realm of positive possibilities:

  1. Retrospective and prospective meta-analyses are now permissible. I don’t think many people have thought about the ramifications of the problems that this uncritical acceptance of linear methods on ordinal data by primary researchers hampers the attempts by meta-analysts to synthesize the results.

I think I can prove formally that MA’s of ordinal outcomes using parametric approaches (ie. meta-analyses of treatment for low back pain using average pain scores as an outcome) provide essentially 0 information – we can’t even trust the reported sign of the MA. But that is for another post.

  1. For fields that will be constrained by small samples, the ability of proportional odds models to adjust for covariates might permit the use of a prospective synthesis of N-of-1 studies to simulate a more traditional parallel group study.

Please correct me if I’ve made an error in any of the above assertions.

1 Like

Hi Lawrence

Non-expert question- do you think that the most fundamental reason why sepsis trials don’t replicate is that methods to assess sepsis “severity” are problematic or that definitions for sepsis itself (as a syndrome) are problematic? I checked Uptodate for the most current definition of sepsis and it seemed clear as mud.

Is it reasonable to expect reliable tools to be developed for measuring the severity/predicting the prognosis of a condition for which there might not even be consensus around disease definition/identification? Don’t issues around disease/syndrome construct validity (not sure if this is the right term…) have to be solved before therapies for that condition/syndrome can be studied in any sort of reliable way?

Other (devil’s advocate) questions:

-what proportion of sepsis trials involve statisticians (i.e., people who do statistics exclusively as a career) at the design phase?

-is it reasonable to expect statisticians to resolve epistemic questions around sepsis definitions that seem to have plagued critical care for decades?

1 Like

Thank you Dr. Brown

Actually I am not cynical but optimistic about all of this. My posts have a clear mission. To use social media to induce debate and help the young to break from their mentor’s dogma and seek new paths. To your point there is an understanding of the need among the thought leaders, as shown by the 2016 article you cite, but that is not mitigating. Reviewers still expect the standard measures and thought leaders pontificate about them as if they as solid as the formula for torque. We recently noted in a grant submission that we needed to discover the sepsis vectors and that there was no agreement on the criteria for sepsis. The reviewer snapped back… there is agreement and its Sepsis 3! (SOFA). That would have been funny if the reviewer had not been not serious.

Pushing all research through the funnel set up by past leaders is a known pitfall, yet we allow it, that’s our fault.

There is always a call from leaders for considering other approaches (as in the 2016 article you cite) but don’t expect that to be given due consideration. As evidence of that I would note that only one year earlier (2015) they introduced SOFA as the new diagnostic standard for sepsis RCT. They are still mandating that standard RCT measure guessed 24 years ago having only recently abandoned the standard guessed from 31 yrs ago (SIRS).

You can"t make this stuff up. I respect everyone in the process. These datasets are very complex. They overwhelm most statisticians because many are time series data with rapid relational time patterns. Reading the longitudinal ordinal outcome approach posted by Dr. Harrell is very exciting.

The problem with failed intubation is not that one cannot get the ET tube in the trachea but rather that one thinks they did get the tube in and they are wrong.

We have a failed intubation state. It’s time to teach the young that it’s their job to discover new sepsis RCT measures and not to simply deploy those which were guessed when their mentors were young.


1 Like

Thank you for the question. It seems likely that the reason that the studies are not reproducible is caused by changing mix of cases in each new RCT study group. The compromises made to gain larger cohort size produces a fragile state wherein the small baseline group of diseases for which the protocol is not the right treatment will not be stable. This baseline (and the associated protocol failure rate) can markedly increase if the disease population mix changes. Therefore one RCT will show benefit for the group and another will not because it’s a different mix.

No it is not reasonable. My point is that the hard work determining the signals which define the different diseases (and the mortalities of those disease), of which sepsis is comprised, has to be done but this is not possible if we a priori define “sepsis” for RCT by a set of unifying thresholds.

Very Very good question. We are designing one now and we don’t have a statistician in the mix yet. Given your question I will fix that next week.

No. I am simply saying that the math is a continuum from measurement to statistics to output.

If a statistician simply asked " What was the origin of this measurement (eg SOFA score) ? How reproducible is it? Where did these threshold cutoffs come from? Why are these signals chosen?

These are the questions required to make sure the statistician is not wasting her time embellishing a fake measurement. If nothing else these questions would result in an enhanced discussion of limitations of the trial which would help future researchers avoid repeating the design error. Without the questions we get what we have already seen in this forum, siloed discussions about the statistics used with the SOFA measurement, with no discussion of the math or limitations of SOFA. That fools the young researcher and statistician into thinking SIRS or SOFA are valid and reproducible measurement tools so they simply make the same mistake propagating the mistake for decades.


The fundamental point is that trialists do not understand this. To them SOFA is a tried and true measurement. It’s straightforward as sum of ascending thresholds of six signals of organ dysfunction so it must work. They can’t grasp that without weighting this cannot be valid and even with weighting it"s a fragile very limited set of static signals for a massive collection of diseases called a “syndrome” which is wider than any in medical science.

That’s why the statistician has to explain this stuff in ways they can understand because they are not going to change unless someone they respect shows them why, mathematically, they need to move on from this (respectfully) oversimplified thinking.


This is very interesting. Its beyond my pay grade but I will show it to the team.

We are applying ML in supervised learning mode using logistic regression to generate, for example AUC as a function of time. What are the limitations of the AUC/time? Are there better time measures we should be making other than AUC/time (and the derivatives of AUC/time).

Thanks for the thoughtful comments Paul. The one part I don’t understand is that if there is clinical expert consensus about the elements of the ordinal scale being in the right order, why would you need to look at treatment effects on individual scale components?

This is a key statement and reminds me of a component that should appear in a summary score methods paper that I hope someone writes: If you flexibly model the effects of all continuous variables (using e.g. regression splines), how many such variables do you need to provide the same predictive discrimination as contained in the entire summary score? If you can easily beat the score, you know the score’s thresholding was information-losing.

On a related issue, ages ago I did an analysis on the APACHE III database that showed if you splined all the physiologic variables the added information equaled the total information in all the diagnostic indicators used in the APACHE III. In other words, if you extracted maximum information out of continuous predictors, you could ignore the ICU admission diagnosis.

1 Like

it has become convention to analyse components separately, im pretty sure EMA guidelines demand it, but see eg chapter “Composite endpoints in clinical trials” in Methods and Applications of Statistics in Clinical Trials.

also, from experience id anticipate that the clinicians will became very interested in any “result” and it will incite secondary analyses in the hope of understanding it precisely, and when this fails it will invite discussion and promote sceptism

regarding why we would analyse components separately, you made a point in the quality-of-remdesivir-trials thread: “we know from the international survey of thousands of patients that they place great weight on shortness of breath”

i appreciate this point, however i think eg Sun et al. showed in their paper that dyspnea can dominate a composite: Evaluating Treatment Efficacy by Multiple End Points in Phase II Acute Heart Failure Clinical Trials

you can see in table 4 that VAS AUC is superior to the clinical composite on power. And that is fine, i would just want it made explicit what is driving the result, rather than being hidden in the mechanism of the calculation for the composite

I tried to show, by defining “Influence”, that patient reported outcomes can drive the difference observed on the composite measure, and how this Influence is sensitive to the arbitray cut-offs used: Examining the Influence of Component Outcomes on the Composite at the Design Stage

im not sure to what extent any of this is relevant for the covid ordinal outcome, but one common issue with these rankings is what to do at the lowest rank. Some have suggested using eg a biomarker to discriminate between these indiviuals rather than leave some large % of the sample with a tied rank (when clinical events are low). But then the biomarker overwhlems the composite (i tried to illustrate that here: cjc, see global rank in fig2)

i just dont think many will be satsified with a conclusion of the kind: the intervention is superior on some conglomerate of outcomes, it’s inviting scepticism and people like to practice scepticism

Paul as with your “biomarker overwhelms the composite” statement I’m having trouble with the logic. If clinical censensus pre-specifies the severity ordering, why second-guess it after the data are in? If patients give a lot of weight to dyspnea, why not count it that way?

Some background needs to be added to my question: What we really need is careful patient utility elicitation (using time tradeoffs, willingness to pay to avoid complication x, etc.) and then to make inference on the utility scale. Since we are unable or unwilling to go to the extra trouble of running a utility study before the clinical trial is completed, we can use ordinal outcomes to approximate the utility analysis. And if someone later specified utilities, the same ordinal model can be used to estimate expected utilities by treatment.

1 Like

one reason is, i dont think we’re willing to forgo a precise definition of “better”. In the remdesivir thread you give the interpretation for PI: “The estimated probability that a randomly chosen patient given treatment B has a better clinical outcome than a randomly chosen patient on treatment A is 0.7”

i can’t imagine many researchers will be satisfied with this as the conclusion of a pivotal RCT. I worry, also, that it’s a sleight of hand; much scepticism has been sold to us re surrogate endpoints (piantadosi’s book comes to mind). Smuggling in surrogate endpoints seems to me quite cheeky

with undue cynicism i always ask myself: could i rig a clinical trial with this method (in the spirit of marcia angell). We are seeing composites now appearing in diabetes, composites of a1c, hypos and weight gain maybe. Okay: if I know the competitor’s drug is not as good on weight gain (consider novo nordisk v anyone), they treat to target so they’ll may match on a1c, and I would define the weight gain as miniscule, and then I’m going to win

there’s a lot I can do when no one can see the inner workings of the composite which is quietly trading off events, it’s too murky

edit: i should have also said that agreement on ordering/ranking is not easily achieved and likely researchers not involved will contest it. there are classic example where patients and clinicians order differently eg patients give stroke higher rank, and the paramount thing is to produce a result that will persuade


Why couldn’t a procedure like Saaty’s Analytic Hierarchy Process (and the extension Analytic Network Process) be used to determine the rankings? That way, the outcome system can formally balance the needs/preferences of patients with the scientific needs of the investigators.

1 Like

i do prefer pre specified weights to the ordinal/ranked outcome. but id just make one point: it’s been around for a while and no one is using it. you have the biggest names in cardiology pushing them and still no one is using them. When weights are made explicit they look flimsy and untenable, that is the advantage of the ordinal outcome, you obviate that problem. With all the benefits of bayes it remains on our bookshelf, why? I think, likewise, when trying to infuse weights into an analysis, people will have a visceral reaction to that, possibly, and with all this talk of reproducibility we really want to start prespecifying weights and using bespoke endpoints?

edit: i should say i was reacting to this comment on the wiki page: “It represents an accurate approach for quantifying the weights of decision criteria.”

To keep a good discussion going: Why do you refer to the less parts of an ordinal outcome scale, e.g., dyspnea, as surrogate outcomes? From a patient-oriented perspective aren’t they full outcomes?


Maybe I’m interpreting this wrong, I understood the quote above to mean that others think that these ordinal scales are examined sequentially in descending priority for the “significance” of benefit.

If I understand you correctly, you just want to estimate the OR at all points of the scale simultaneously, and leave the decision to the audience.

I think your method leaves open the possibility of predicting benefit from covariates.

I’m looking forward to clarification.

1 Like