What is a fake measurement tool and how are they used in RCT

[quote=“f2harrell, post:27, topic:3955, full:true”]
Short of that, IMHO we should put our energy in to deriving and checking the outcome measure, then stick with it, and definitely not by concentrating on simple statistics for low-power component comparisons.

Agree, In my view the key words of the quote being “deriving and checking” not “guessing and promulgating” that is all I am asking for. No guessed measurements from the 1990s for 2020 RCT.

1 Like

i wish i had more time to think about this, i will have to come back to it. i would just note that identifying sepsis in retrospective analysis seems just as challenging as it is in prospective analysis. i saw this recently: “In two recent papers, the same authors offered contradictory findings on the global incidence of sepsis, outcomes, and temporal changesover the last few decades.” from: Sepsis: the importance of an accurate final diagnosis i guess this is a good place for stato and clinician to work together


Yes the “sepsis” experiment of human science, which probably began in ernest in 1980s, may be THE most illustrative example of failed interaction of statisticians and clinicians.

The 1980s was a simple time. The era of threshold science, the era of the perceived near infalibility of ROC.

It was a time of heady early exuberance for the creation of “syndromes” with simple thresholds sets for RCT (with little or no knowledge of HTE).

In the 1990s the threshold sums (scores) became popular. Guessed by the experts of the time most persist as standards today and are widely used in RCT. When a new pandemic emerges they are there, for the RCT of 2020.

The promise was not fulfilled. Yet despite substantially no positive RCT reproducibilty no general call for reform or reform movement has emerged. The operation, statistics and ruminations of sepsis science is basically unchanged for 30 years as evident from the discussions (eg CITRUS-ali in this forum.

Sepsis is a leafing killer so the lack of a reform movement is one of the great tragedies of the 21st century.

Perhaps ML/AI will come to the rescue. Uneffected by the need to comply with the collective, a weakness shared by human statisticians and scientists. However. if surpervised learning is applied with the mandate to use the present standards (SOFA) for the training sets that will be render another cycle of failure.

Some courageous statistician has an opportunity now for an amazing story an amazing heart stopping review which will make a real difference. I hope she rises soon to be counted as the reformer the world needs.


Is that published anywhere?

No, I wish it were published.

I had to go back to this quote. It is so well stated. Furthermore, I would like to generate discussion of the aspects required of a valid measure that might result from that challenge.

I’ll suggest 3 aspects which I think are required.

  1. Includes ALL nonduplicative features (independent variables) which are signficantly predictive of mortality. (SOFA has no measure of myeloid signal perturbation so it will fail in cases where that independent variable is dominant).
  2. Does not include any independent variables which do not rise to the statistical level.
  3. Must be optimized using a formal method. (SOFA was guessed, like an input vector which was simply promulgated, and never optimized)

Here are a few questions for the group.

  1. Are these correct?
  2. What are the other aspects?
  3. Are there standard mathematical tests statisticians can apply to determine the validity/reproducibility of a measure used in an RCT?
  4. How might those tests be applied to SOFA or an alternative?

Maybe a blog post? It would be really great to read this.

The issues raised here and in related topics seem very pervasive and crop up all over the place. There is a plethora of these summary scores that are “clinically accepted” (a phrase that has become a red flag for me) as outcome measures but only questionably measure the true outcomes of interest, and have weird mathematical properties that are rarely understood. They are usually just treated as it they were continuous outcomes, which is rarely justified - for a start, it’s common for some scores to be impossible or rare.

Maybe we should try to insist on actual measurements of variables of interest - and use “primary outcomes” of a set of (probably) inter-related phsiological variables, and ditch the summary scores. Maybe?

Stephen Senn had a good journal article on this issue of measurement.

In rehabilitation science, I’m seeing an uncritical acceptance of psychometric procedures to “validate” clinical assessments. I’m having questions about the actual predictive validity of these in the typical clinic. Joel Michell has been making an argument on the ordinal nature of psychological constructs for a long time, with this paper being the best introduction to his outlook.

Similar, independent observations were noted by Norman Cliff, another psychometrician.

Cliff, N (1992) Article Commentary: Abstract Measurement Theory and the Revolution that Never Happened. Psychological Science.


Finally, I’ve always appreciated re-reading this paper by David Hand on the intersection of measurement theory with statistics:

Hand, D. (1996) Statistics and the Theory of Measurement Journal of the Royal Statistical Society

@llynn The problem of rating scales as outcome variables for RCTs was discussed in the context of neurology, with a good introduction to the complexities of using these scales in studies.

Hobbart, J; et al (2007) Rating scales as outcome measures for clinical trials in neurology: problems, solutions, and recommendations The Lancet Neurology, Volume 7, Issue 1, January 2008,

1 Like

Because of low power to look at individual components in some cases, and because with frequentest inference there are multiplicity problems, we are stuck with needing summary scores. Let’s construct them wisely.

1 Like

The articles linked below might be useful to include in this thread:

This one highlights historical problems with trials studying treatments for another complex “syndromal” illness- major depression. The overlap with the history of sepsis research seems striking: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6078483/. Have sepsis researchers ever collaborated with researchers studying other conditions that involve similar challenges (e.g., depression, chronic pain), with a view to sharing approaches/lessons learned?

These articles from a few years ago summarize the many devilish problems with sepsis research:

The “Phenotyping Clinical Syndromes” thread links to this article:
https://jamanetwork.com/journals/jama/fullarticle/2733996. The ensuing discussion highlights problems with “data-driven” approaches to identifying different sepsis phenotypes. The main criticism seems to be that humans can always identify patterns in chaos (especially now that we have computers to help). After identifying patterns, we then create categories for things. Unfortunately, the chance that these categories reflect underlying “truths” about the essential ways in which groups of things differ from each other, and that they can then be used to design reproducible clinical trials, seems low. To this end, maybe articles related to “cluster analysis” and “latent class analysis” (and their inherent problems) are also relevant (?):


Very intriguing. The intersection of math disciplines appears to be a problem when one discipline is poorly defined (or poorly definable) with the extant math. Imagine if that was true at the intersection of the physical sciences and engineering science.

This is why the burden falls on the statisticians because clinicians are deriving VERY simple functions. Very little of clinical medicine is math.

With a little due dilligence, these simple functions are very easy for a statistician to understand and factor. I could teach them to a math undergrad student in 15 minutes. This is not true for the statisticians who are using complex math, substantially the the entire discipline being math.

So if a clinician researcher says she wants to use SOFA chances are, she does not know the function SOFA is intended (but fails) to render.

A statistician should not connect a perceived function as a measure in an RCT to statistical math without understanding the function’s derivation, behavior and reproducibility.

There was a improper joke I saw an internist play on a surgeon when I was student.

After taking a history and examining the patient he drew a line on a patient’s abdomen over the gall bladder and next to it he wrote “Cut Here”. Now, of course, no diligent surgeon would comply without her own investigation.

When applying statistics to a measurement like SOFA, statisticans are, without due diligence, dutifully “cutting here”.

Yes, thankyou. These are relavant articles. But note that the leaders have progressed to this level of understanding but still default to SOFA (a composite measurement guessed in 1996 when most of this heterogeneity and its effect on RCT was not appreciated).

SOFA (delta 2 points) was chosen as the new 2016 measurement defining sepsis using the Delphi method.

Now SOFA itself had been shown in a large data trial to be nonspecific for sepsis in 1998. In fact SOFA was so nonspecific, they changed the word defining “S” of the SOFA mnemonic from “sepsis-related” to “sequential” (See 1998 article not just abstract.)

However, the new 2016 sepsis definition required “suspician of infection” (ie a prior) plus delta 2 SOFA.

That meant that after 2016 SOFA was the input, and SOFA could still be used as an independent variable and SOFA could be the primary or secondary output. Like turtles, by 2016 it became SOFA all the way down for RCTs of mankind’s greatest historical killer.

The issue of phenotypes and cluster analysis is pivotal. As I said before, sepsis is a 1989 cognitive bucket filled with different diseases (can be called phenotypes) which look similar clinically but induce profound HTE.

Reading these articles it’s easy to say this is too complex, “I’m going with the consensus”. However, these articles do not lead one to beleive SOFA will work. In fact, to the contrary, for the knowledgable they teach away from that approach.

While none of these articles conclude SOFA would work, they are silent on what measure(s) to use. Rather there is a conclusion that the matter is complex AND maybe the “Delphi Method” can sort it out. Which is what was done and where SOFA was selected.

However complex all of that is, the SOFA itself is comprised of simple math and easy for the statistician to factor and understand.

That’s what we need; statisticians which will rise to the challenge and say, “I may dutifully use the consensus measure but I’m not going to do that without learning, and explaining to the PI, its limitations.”




Given the pandemic there is an urgency to consider another myth. See article in Chest a prestigious journal.


Comparison of Hospitalized Patients With ARDS Caused by COVID-19 and H1N1 -…

There were many differences in clinical presentations between patients with ARDS infected with either COVID-19 or H1N1. Compared with H1N1 patients, patients with COVID-19-induced ARDS had lower severity of illness scores at presentation and lower…

I would add to the myths the common view that the output of a threshold based function (for example a composite sum of scores) can be used as a measurement and incorporated into the statistical math of a observational study or RCT to render a valid, reproducible result without due consideration of the origin, derivation, reproducibility of the function itself as well as its applicability to the specific population under test.

(In the example of SOFA, which I discussed in another thread, the measurement is derived from 30 threshold levels of 6 measurements rendering 6 values which are added to produce the score upon which statistical math is applied to render a statistical output.)

Now watch how statisticians incorporate this “measurement” into a study.

“Patients with H1N1 had higher Sequential Organ Failure Assessment (SOFA) scores than patients with COVID-19 (P < .05).” … The in-hospital mortality of patients with COVID-19 was 28.8%, whereas that of patients with H1N1 was 34.7% (P = .483). SOFA score-adjusted mortality of H1N1 patients was significantly higher than that of COVID-19 patients, with a rate ratio of 2.009 (95% CI, 1.563-2.583; P < .001).

The “sofa score adjusted mortality” proves the saying “You can adjust for a baseline ham sandwich.”

Probabaly some trusted the adjusted stats here.

Joking aside. Andrew, I would welcome your respected input in the measurement thread about composite scores and SOFA…

1 Like

Now perhaps, after that last paper’s adjustment for SOFA as a covariate in a Pandemic trial, the reason for my prolonged, merciless provocations in this forum is becoming clear. (As a reminder, to my knowledge, there has never been a single reproducibly positive RCT of sepsis.)

As I have shown SOFA is:
1.The standard trial entry criteria for sepsis
2. An independent variable in a trial
3. A trial outcome (primary or secondary)
4. A baseline covariate of a trial
5. Adjusted for, as a covariate, in a trial

That is alot given SOFA was guessed in 1996. However, guessed origin aside, the question now (for those in this forum who have never been shy about advocating adjustment for covariates) is:

Is it appropriate to adjust for SOFA as a covariate?

1 Like

Given that the possible adjustment variables are measured at the moment of randomization or earlier, the question becomes a relative one in my view. What is the best bang for the buck? For a given sample size (so that we keep overfitting in mind) is there another index or set of variables that can be pre-specified and that will explain more outcome variation than SOFA? We might pause a moment to look at competitors. As I was on the APACHE II and APACHE III development team, I’m not in an unbiased position, but I think that the APACHE III acute physiology score takes more physiologic variables into account and for each physiologic variable has more resolution by using more (only slightly arbitrary) intervals for that variable. Of course APACHE III has been used a lot in sepsis clinical trials as an adjustment variable and as an effect modifier, with nonzero success.



I dont think there was anyone in the past who did not think that was the right thing to do in these critical care trials.

It seems likely now, but only in full retrospect, that these scores were effectively encapsulating hidden heterogeneity and variability into the studies.

Did you ever use “delta APACHE”. Here authors conclude that “delta SOFA” is superior to “fixed day SOFA”.

I have seen discussion here about the problems with using baseline change. Does anyone have thoughts on a “delta score” as an endpoint.

Here, in 2017, the authors discuss why a derivative of SOFA as a RCT endpoint may be needed.

"There are several unresolved issues around the validity of the SOFA score as an endpoint. First, the responsiveness of the SOFA score to intervention-induced change in mortality risk has not been quantified. It is unclear how the SOFA score changes in response to a treatment that changes the mortality risk within a specific timeframe. Second, the consistency of the SOFA score to reflect changes in underlying mortality risk has not been quantified. Even if true mortality-modifying treatments effects are reflected in the SOFA score on average, the validity of the SOFA score as an endpoint is doubtful if this relationship is inconsistent. Third, it is unclear which derivative of the SOFA score is the most appropriate endpoint.

Therefore, the aim of the present study was to quantify the responsiveness and the consistency of different SOFA derivatives to reflect treatment effects on mortality. The results from this study may aid clinical decision makers in the interpretation of trials that use SOFA as an endpoint, and may help investigators choose the most appropriate SOFA derivative in the design of future RCTs."

1 Like

A superset of all this is daily acute physiology scores analyzed with a longitudinal model. I expect this to have maximum power. It remains to be seen whether it qualifies as a surrogate outcome in the Prentice sense.

1 Like

For anyone who is unaware of the sister new thread for this discussion it is linked here.

The link below is to an approximately 19 min summary talk.

If you are a statistician, buckle up, because you will not believe this 35+ year story could be true, . The story is truly incredible and shows the effect of decades of good statistical math using standardized but invalid (guessed) clinical measurements. .

Please forward. We are trying to garner awareness of the massive waste of resources, opportunity for discovery, and talent… ’



If you read this thread you will not be surprised that the leaders have NOW suggested that the SOFA score needs to be updated (after 25 years). The “need for update” of SOFA is discussed here… .

Changing SOFA would presumably change the criteria for RCT of the “Synthetic Syndrome” of sepsis and which is based on the old 1996 SOFA. A discussion of RCT of synthetic syndromes is provided here.
. …The End of the "Syndrome" in Critical Care

SIRS the old threshold set for sepsis was guessed in 1989 and was the core of the sepsis criteria for RCT (with updates every decade until 2015 when it was abandoned and replaced by SOFA…) As expected given the use of guessed thresholds as measurements to define the synthetic syndrome, no positive sepsis RCT has been reproducible for 30 years.

There is a common theme here discussed in the sister thread.----guessing and updating RCT measurements has replaced discovery. No need for discovery of measurements when you can guess and update them and do RCT.

So, it was predictable that an update of SOFA, like those which which were guessed every decade of the sepsis criterial with SIRS. These updates are widely cited at the apex of another decade of RCT…

This is “science” by administration not discovery… Statistician beware…