A Longitudinal Renal Health Outcome for Clinical Trials in Acute Kidney Injury?

JohnProwle · May 19, 2026, 4:02pm

Thanks to Frank Harrell for suggesting I bring this topic to the forum

As a clinical academic, intensivist and nephrologist my research is focused is on Acute Kidney Injury (AKI), particularly in the intensive care unit.

Reader of this forum will be aware that there are many concerns (1) with design and formulation of endpoints in AKI in clinical trials, a concern shared by myself and many researchers in the field. In particular, arbitrary dichotomisation of continuous data related to kidney function, implicit inclusion of change from baseline in AKI definitions without inclusion of acute level of kidney function at baseline & during disease, and lack of incorporation of temporal elements (duration and trajectory of kidney dysfunction which will best correlate with long term outcomes).

In addition, there are several biological and practical issues with AKI definition and severity scaling. These are related to the kinetics of serum creatinine with alteration of kidney function and muscle mass and well as data availability on pre-morbid kidney health and the effect of interventions such as dialysis.

These deficits impede the performance of suitably powered early phase studies and may account for previous failures to demonstrate benefit from interventions. There are also valid concerns that AKI (as defined by biochemical fluctuations and alteration in urine output) itself is not a suitably patient centred endpoint for pivotal studies

Our ambition is to develop a fit for purpose longitudinal ordinal endpoint for ‘acute kidney health’ to allow us to capture the burden of organ dysfunction over time. We believe such an outcome could capture the severity and fluctuating time-course of the disease episode and biologically best correlate with long term kidney health in recovery. It would also allow permit us to apply the Bayesian transition models described by Dr Harrell and colleagues in their 2024 Statistics in Medicine Paper,(2) especially by allowing the assessment of clinically interpretable estimands such as days of organ dysfunction over a clinically meaningful threshold.

We also believe AKI could be a great exemplar for these techniques given the depth of relevant longitudinal clinical data available, and the failure of existing approaches to deliver meaningful clinical results.

There are however many potential facets of kidney health which we might consider, allowing for several different domains to feed into the ordinal scale (ie biochemical values and urine output). Thus different points of the scale might be defined by absolute or relative changes in serum creatinine or other measures of kidney function, severity of low urine output, other biochemical parameters such as serum urea/BUN and use of kidney replacement therapy and death as an absorbing state.

We are working on a paper which outlines a proof-of-concept ordinal kidney health definition, however we plan to more robustly define the clinical elements by expert consensus and will be seeking patient input (as to relevance and importance) and pharma industry input regarding incorporation into studies and regulatory aspects.

While, to retain interpretability within our existing clinical concepts, our ideas for a kidney health definition do include existing ‘problematic’ aspects of AKI - including dichotomised severity levels and relative changes - we feel these may be offset by the longitudinal assessment of a more granular ordinal scale over time, as well as the incorporation of several paths to achieving a given scale level (absolute levels of kidney function tests as well as acute changes). However, this is something we will have to carefully consider in our formal development process.

The strength of our group is a network of experts with a lot of clinical experience and physiological understanding of acute kidney health and also access to data from existing clinical studies for sense-testing and preliminary analyses prior to prospective use. However, while we have been able implement some test analyses with our proof-of-concept scale, to take this forward as a group of clinician-researchers we obviously need formal expert statistical support and collaboration.

As a preliminary we seeking some advice on whether this community feels that ‘acute kidney health’ is a suitable setting for designing and applying a longitudinal ordinal outcome and specifically:

If the more complex formulation of several routes to achieve the same ordinal scale level is going to be problematic analytically? - as compared to a simpler scale of level of organ support (as has been implemented in respiratory disease).
If to what extent do we need to worry about abrupt transition of states which might inform the scale design?
What are the problems in carrying forward observations or other imputation in case of missing data in one component on a given day?

Finally, we are wondering if there are any potential collaborators or expert commentators that might be interested in helping us with the work going forward.

Many thanks for your interest!

John Prowle for the LOKI (Longitudinal Outcome in acute Kidney Injury) Investigators

Some illustrations

Our draft ordinal scale of 7 levels is based on recent use of dialysis/RRT absolute and acute relative changes kidney function test, urine output on a given day, blood urea albuminuria and other diagnostic biomarkers, location of care (in ICU), and death as an absorbing state.

Based on the published codes we have been able to implement our draft ordinal outcome definition in an observation dataset (16609 adults with ICU length of stay ≥ 5days without end stage kidney disease from the MIMIC IV database) as well as data from an ICU clinical trial of a potentially kidney-injurious intervention (protein supplementation). I attach these as examples. The effect in the randomised study is very small (in keeping with our knowledge of the study findings more generally) and this is presented on as a demonstration of implementation of the transition model not at all as a finalised result. (Apologies for size but only one upload allowed)

davidcnorrismd · May 19, 2026, 5:13pm

Given how you’ve already branded your effort, it may be too late to advise against setting out presumptively to define any One Outcome to Rule Them All. Still, I think it so much more fruitful scientifically to start from scientific questions, and to construct outcomes specifically adapted to them.

If you wish to advance a unifying perspective, seek this in a general formalism rather than a standard outcome. For such a formalism in this problem setting, I would reach for state-space modeling and particle filtering methods [1,2]. Of all the organs, kidney physiology surely has the richest theoretical basis; the modeling opportunities here should be marvelous. (Also, how could nephrologists not love filtering? )

Künsch HR. Particle filters. Bernoulli. 2013;19(4):1391-1403. doi:10.3150/12-BEJSP07
Kantas N, Doucet A, Singh SS, Maciejowski J, Chopin N. On Particle Methods for Parameter Estimation in State-Space Models. Statist Sci. 2015;30(3):328-351. doi:10.1214/14-STS511

Louis_Martin · May 19, 2026, 7:51pm

I’m curious about the logic of your categories given that you’re transforming several continuous and categorical variables into a 7-point scale. How can you be confident that a given level of urine output represents a worse degree of kidney injury than a certain change in a biomarker?

Another approach could be treating the best objective measure (or proxy) of AKI as a continuous variable analyzed ordinally. Let’s take urine output as an example. You could say that anything above a certain threshold (like 0.5 ml/kg/hr) is normal and anything lower than that would be worse, ending up with ordered categories like: \ge 0.5\, ml/kg/hr,\, 0.49 ml/kg/hr\, \dots 0 ml/kg/hr. Various interventions or death could be categories that are worse than no urine output alone.

I don’t know how doable such a thing is computationally, but I’m wondering if something like this might better address the categorization problem.

JohnProwle · May 19, 2026, 8:56pm

Thanks these are very natural questions

Essentially AKI has several potential indices which are indirect reflections of true underlying kidney health - there is already an implicit equivalence in the KDIGO AKI definition by defining ‘equivalent AKI severity levels’ (1-3) based on either urine output or serum creatinine changes . This is arbitrary and imperfect but has the advantage of capturing different temporal aspects of kidney health.

The truth is that there is no ‘best objective measure of AKI’ - at different timings and at different stages different indices will reflect underlying pathology best.

Our intention is to adjudicate these differing metrics (which collectively are very difficult to handle) into a scale based on expert consensus and epidemiological evidence. This will be intrinsically imperfect but potentially a useful and implementable tool to interrogate clinical outcomes longitudinally.

The approach suggested to urine output is essentially what we would implement but across several complementary domains and not one alone.

We could of course merge these metrics into a semi-continuous ‘score’ based on some from of regression model, but I think that this would be more opaque in design and interpretation and not amenable to some of the methodology on state transition developed for longitudinal ordinal outcomes which is very attractive.

It’s also important to remember that interventions such as dialysis are clinical decisions based on clinical information, thus a zero urine output (anuria) might be taken as equivalent to dialysis. Some clinicians would wait until other indications were met knowing that dialysis was inevitable while others would initiate dialysis immediately in anticipation of complications, so if we decided to assign the same level to both to these we might remove some clinical practice variation in the endpoint…

f2harrell · May 19, 2026, 9:34pm

I applaud your efforts John.

I’d like to see the definition of the new ordinal scale.

Your background information mentions changes in SCr. There are fundamental problems with this:

SCr is not a variable that is properly transformed to be used in a subtraction
Any outcome variable needs to have a definition that is independent of the time gap between SCr measurements and should be based on renal function at the time the outcome is assessed
If only SCr were needed, it could be the outcome scale, with no binning, and dialysis or death added to the top of the scale
A few random ideas related to combining multiple measurements into a single outcome scale are here

JohnProwle · May 19, 2026, 9:54pm

Thanks some of this is beyond me I must confess…

Much of the intention of this post is because we lack the statistical grounding understand all the available methodologies however the scientific question is very much a pragmatic one - how do we understand if a randomised intervention effectively modifies kidney injury/kidney health in early phase studies when existing metrics fail to capture temporal elements of kidney health and are falsely dichotomised. Unfortunately we have no gold standard training data to calibrate such a metric - I don’t know the actual location to understand how to filter the GPS data. I have done some extensive mathematical modelling of glomerular filtration and creatinine kinetics (described in the methodology of this paper) however GFR itself is still just a physiological variable indicative of organ function - not directly paralleling structural injury or prognosis. We certainly do not have renal biopsy data in relation to world ICU examples. So we are forced to design a metric that incorporates our gestalt of kidney health and then establish its utility and external meaning. What I really what to do is find something that is 1) feasible to collect and calculate 2) interpretable and meaningful to clinicians and patients 3) amenable to methodology targeting clinically meaningful estimands. This will be an inherently imperfect tool for clinical investigation but much better than what we currently have.

The definition of the states in our model will certainly incorporate our understanding of underlying physiology. The methodology we might adapt is (I understand) based on a Bayesian implementation of Markov Chain Monte Carlo Modelling. What I must confess I don’t understand is how we can use the modelling to determine the state definitions…

llynn · May 19, 2026, 10:01pm

Excellent. We did a fair amount of unpublished work studying creatinine trajectories in MIMIC in the past using what we called time series reciprocations.

In our work, a reciprocation is defined as an annotated temporal pattern composed of four linked elements:

potential causal force,
perturbation,
potential recovery force,
recovery.

This framework generated a structured set of relational intervals and slopes:

Pf → P (perturbation-force to perturbation interval),
Rf → R (recovery-force to recovery interval),
perturbation slopes (P),
and recovery slopes (R).

We stopped the work because in those days AKI was a synthetic syndrome with a standardized consensus definition and there was little interest in TS modeling.

AKI itself is a synthetic data generating process (SDGP). This is a synthetic threshold set which generates data but like sepsis and ARDS is not a biological entity but rather the data generation is derived from a heuristic grouping of different diseases (causal pathways) generated by the consensus threshold set. As such, a cause agnostic RCT of AKI will generate the third estimand as I have described previously.

Happy to show you how we approached the study in a Zoom review of the TS patterns but we were in the discovery mode when we stopped.

As far as an ordinal of multiple signal thresholds (a renal SOFA), I understand the desire for this but this is problematic. SOFA like SIRS is a SDGP and to my knowledge, neither has ever been useful in RCT since they were created in 1996 and 1987 respectively.

The key is to identify and link the perturbation forces to the extent possible with the perturbations associated with renal dysfunction and the recovery forces and the recoveries of the perturbed signals.

There is a need to study the relational time series patterns (RTP) of these reciprocations for each signal and the signal together. One problem with TS patterns is the application of statistical analysis. Our view was that given the complexity the TS patterns should be objectified and we accomplished this by 6 scales.

The most important first step is to avoid creating and then studying a new synthetic data generating process (which is what AKI as defined by consensus thresholds, was).

JohnProwle · May 19, 2026, 10:44pm

Many thanks for your positive comments

I think the crux of the issue is to design a combined scale purely by consensus and then test it, or - as you suggest - derive a weighted combination of the components by some form of modelling. The problem here is that the various components are not directly in parallel ie dialysis is a yes/no near the top with creatinine sitting below this.

I completely appreciate the issues with creatinine changes - however to some extent one does need to bring the clinical community along and at least consider the pros and cons of the existing definitions within the scheme. In my defence I would say

We do propose an absolute estimate of kidney function (imperfect through it is outside of steady state) in addition to a ‘rate of rise’ and score based on the worst level. Creatinine does need transformation here as it is reciprocally related to kidney function and is dependent on underlying muscle mass so just a creatinine scale would not work. This also allows other methods of GFR estimation to also be used. There is a big problem with decreased creatinine. generation in critical illness - to some extent up-grading the scale based on ICU location is an attempt to offset this and incorporate global illness severity.
For AKI creatinine changes are fold changes x1.5, x2 or x3 (not subtractive) within a seven day period and indicate the slope of rise, and thus the eventual plateau in creatinine reflecting the actual underlying kidney function outside of steady state. We would also discount any state based on creatinine change when the latest creatinine measurement was >72h ago.
the 0.3mg/dl rise (which is subtractive) has to occur within a 48h window and thus represents a swift uptick of creatinine - this has a lot of sensitivity at the lower end of the ordinal scale for worsening kidney health - if this does not continue to rise then the ordinal scale will revert to the lower level after 2 days

Actually - from scoping real world datas - the estimated absolute kidney function is likely to be the driver of the ordinal state during most of the ICU stay and the AKI thresholds potentially just getting there a little quicker offsetting the delay in creatine accumulation rise with and abrupt change in GFR thus the rationale for a combination.

The draft formulation is a bit complex and is designed as a starting point for a design process not a finalised version - so I didn’t want to over emphasise it. However here’s the working formulation that I applied to the examples. I have some extensive discussion/justification for these choices for our publication.

many thanks again

John

llynn · May 19, 2026, 11:13pm

IMHO you should be commended for the hard work to create this consensus scoring system and this scoring system might be very useful for general broad indication of the severity of renal dysfunction across a ICU population. However this is a synthetic data generating process (SDGP) so it not useful in trials. I would recommend that the team try to find just one of the many hundreds of (perhaps a thousand) trials in the past where equivalent SDGP of SIRS or SOFA were useful in trials. Essentially all such trials failed or were reversed (sometimes for harm) for structural reasons. I haven’t examined the AKI literature but since it is also a SDGP, I suspect it has failed as well.

The result of the study of an SDGP is discussed here in this preprint which is in review.

davidcnorrismd · May 20, 2026, 10:09am

This is, in my view, the very opposite of a scientific question: it is a question rather about how to industrialise the AKI research enterprise. The core principle seems to be to deliver a generic outcome definition that liberates researchers from the burden of formulating sharply defined scientific theories specific to the particular intervention being ‘studied’.

This is indeed precisely the problem addressed by state-space modeling: there is some latent physiologic state X_t which — we theorise — evolves over time according to some ‘equations of motion’, and this state gets reflected in noisy observations Y_t (or measurements) which we may ‘filter’ to recover estimates of the underlying state variables. A basic application of these ideas can be found in this conference poster on tacrolimus dosing [1].

Norris DC, Gohh RY, Akhlaghi F, Morrissey PE. Kalman filtering for tacrolimus dose titration in the early hospital course after kidney transplant. F1000Research. 2017;6. doi:10.7490/f1000research.1113595.1

f2harrell · May 20, 2026, 11:15am

SCr does not work for either subtraction or for fold-change. I’ve learned a lot from our analysis of critically ill patients in the SUPPORT study where we have data on days 1, 3, 7, 14, 25 but unfortunately I forgot to include dialysis status in creating the easier-to-use public datasets in Datasets. Some of what I learned is in 14 Transformations, Measuring Change, and Regression to the Mean – Biostatistics for Biomedical Research . SCr is more prognostic than eGFR and does not require sex or age interactions as eGFR does (I know the opposite is supposed to be true). You can look at patterns of SCr over time to predict ultimate outcome - either death or a later SCr. The prognostic value comes mostly from the last measured SCr and it has a very nonlinear, indeed nonmonotonic effect (but the latter could be caused by rescue dialysis).

I think we need more intensive analysis of serially collected physiologic measures to derive a sensitive and prognostic function measure.

Ryan_Haines · May 21, 2026, 8:59am

Thank you Frank, I am working with John on this and come from a clinical background too. Based on your comments, could a step forward be to test a longitudinal ordinal SCr / dialysis / death outcome against our proposed scale (urea thresholds, urine output, eGFR, K/BE thresholds, AKI etc)?

I would have to research how to assess the performance of these models against each other.

f2harrell · May 21, 2026, 11:48am

I think so, but a first step could be what I did using SUPPORT data where you do landmark analyses where you keep restarting follow-up at later times, and use the SCr and other variables’ history up until the landmark to predict post-landmark outcomes. This gives you a prognostic scoring of the input variables you are trying summarize into a self-contained and powerful current status measure. You do have to have a suitable outcome to anchor the analysis, e.g., the union of death and dialysis. Or you could predict ultimate deterioration in function by predicting SCr or just urine output as your anchor outcome.

llynn · May 21, 2026, 11:27pm

Respectfully I agree with David here. 2027 is the 40 year dark anniversary of the generation of non-disease specific ordinal scores (synthetic data generating processes) as outcomes for randomized trial use in critical care.

This popular article tells the sad story of how a pathological science based enterprise emerged in critical care to displace Fisher/Hill methodology. 30 such renal scores could be created each compared to mortality. You could add lactate, bili, hypotension, the list goes on. Then the scores could be compared to each other in terms of mortality or dialysis prediction. There is no limit to the study of synthetic data generating processes but reform is coming and these SDGP will be removed from the RCT tool chest.

davidcnorrismd · May 22, 2026, 10:12am

It’s interesting that you seem to identify ordinal scoring per se as a problem here, Lawrence. I had the opportunity to attend Frank’s latest online RMS course, and was deeply impressed by how ordinal cumulative probability models achieve a unification of so much antecedent work, as laid out in RMS §25.3. Any time something like this happens, one has to sit up and take notice. I even remarked to my friend @Drew_Levy that this felt “very ‘category theory’”.

Taking that quip seriously, I would return the Preface to Emily Riehl’s Category Theory in Context:

The category-theoretic perspective can function as a simplifying abstraction,
isolating propositions that hold for formal reasons from those whose proofs
require techniques particular to a given mathematical discipline.

(emphasis is mine)

We are sometimes insufficiently reflective about the nature and degree of the abstractions we are using. Much discussion of DAGs on Datamethods treats them as having an ontological status, rather than (properly) as high abstractions. I like Whitehead’s term fallacy of misplaced concreteness, discussed in this passage from Whitehead’s Process and Reality (Ch I, §III):

… the chief error in philosophy is overstatement. The aim at generalization is sound,
but the estimate of success is exaggerated. There are two main forms of such overstatement.
One form is what I have termed elsewhere, the ‘fallacy of misplaced concreteness’.
This fallacy consists in neglecting the degree of abstraction involved when an actual
entity is considered merely so far as it exemplifies certain categories of thought.
There are aspects of actualities which are simply ignored so long as we restrict thought
to these categories. Thus the success of a philosophy is to be measured by its comparative
avoidance of this fallacy, when thought is restricted to its categories.

The other form of overstatement consists in a false estimate of logical procedure in respect to
certainty, and in respect of premises. Philosophy has been haunted by the unfortunate notion
that its method is dogmatically to indicate premises which are severally clear, distinct, and
certain; and to erect upon those premises a deductive system of thought.

(emphasis again is mine)

Paraphrasing Riehl, I am inclined to think that we should take care not to look past “techniques particular to a given clinical discipline.” This holds especially true for Critical Care, whose underlying scientific basis — as distinct from its current research enterprise — has matured to a point that ought to open up marvelous opportunities. Consider that an ICU is a veritable human physiology laboratory, with an abundance of high-frequency, multivariate time-series.

llynn · May 22, 2026, 12:52pm

David, if critical care scientists thought as deeply as you, we would not be in this state. I should have been more clear that the primary pathological step is the reification of simplistic ordinal composites as the presented renal score.

S=1 ->Delta SOFA 2 ->{“sepsis”}

The score threshold is transformed into a presumed biological entity perceived as capable of generating a disease-level estimand, E2. This is the symbol–mechanism substitution fallacy. If a single courageous consensus-panel–dwelling critical care scientist fully understood the implications of that sentence, the entire epistemological foundation of modern syndrome-based critical care research would begin to unravel.

Of course, once the gate is reified (which they have been for ~ 40 years) then:

the synthetic category is treated as biologically coherent,
the RCT estimand is treated as disease-specific,
and E3 is mistaken for E2.

Thus, the symbol–mechanism substitution fallacy is essentially a modern causal-inference manifestation of the deeper philosophical error you describe: a symbolic abstraction, in this case, a synthetic data-generating process (SDGP) is mistaken for a coherent causal mechanism.

Critical care medicine, despite possessing perhaps the richest physiologic information density in medicine, for ~4 decades has repeatedly collapses relational time-series into simplistic ordinal composites and then treats those abstractions as disease equivalents in the Bradford Hill sense. . The result is the repeated generation of unstable third-layer estimands masquerading as biologic knowledge.

The conflation of ARDS criteria with severe COVID pneumonia was perhaps the clearest recent example. An ordinal syndrome gate was substituted for an actual disease mechanism, and the resulting methodological error contributed to global harmful guideline extrapolations.

If the warnings and mathematical presentations made here in Datamethods had been understood and heeded that global error could have been prevented.

Sometimes I wonder if statisticians, trialists and guideline committees realize how close to the bedside they are. One never sees the type of deep, sometimes painful, intellectual introspection typical of a physician after treatment failure of a patient. Yet failure at the guideline level is profoundly more harmful but simply generates the “heterogeneous syndrome alibi” and no substantive methodological investigation. The cycle begins again. No one seems deeply saddened and intellectually stimulated and introspective as a physician would be.

Unfortunately, the synthetic data-generating process enterprise is not easily subjected to any deep philosophical critique because entire research ecosystems, careers, and institutions are built upon it. As Upton Sinclair observed:

“It is difficult to get a man to understand something, when his salary depends on his not understanding it.”

The offending ARDS guidelines were abandoned but nothing else of intellectual substance changed, so there is no reason why the same global error will not occur again because it is not acceptable to public ally debate the source of the error.

When critical care science, which has the worst record of RCT based guideline reversals for harm of any field, is ready to openly debate E3, then we can wax philosophical. To me the creation of a new real ordinal composite was just an extension of Bone’s SIRS from 1987.