EDA without data dredging

Matthew_Rich · February 24, 2025, 9:53pm

I am trying to fold some of the tenants exposed by the website but I feel I am missing the forest through trees on a basic fact. I am reading Dr. Harrell’s text (currently Chapter 2).

I am a recovering data torturer and serial data categorizer. I am trying to to better for an upcoming project, but I am a bit confused on fitting splines for a given continuous predictor. I work in drug development and my data is always very noisy. I am confused about choosing a testing splines while not looking at the Y variable.

Based on the skimming of the notes and book so far, I am sort of cheating if I plot a X_1 vs Y and start tweaking. My understanding is that all EDA work should be blinded, but how does one pick splines under those conditions.

Any guidance on this will be appreciated,

Thanks in advance.

js592 · February 25, 2025, 3:37pm

With splines the benefit is that you are not pre-specifying a specific “shape” of a nonlinear relationship (only that it can be fit by a spline). The only things you need to set are the knot number/locations. The latter is placed at quantiles of the independent variable and the former is typically set based on how the analyst wants to spend the degrees of freedom afforded by the sample size. Neither of these require looking at Y.

Matthew_Rich · February 25, 2025, 8:09pm

@js592

Probably a dumb clarification question, but I would need to plot against somehting right? If it is not Y, what is the second variable.

So say I have serum creatine levels that I want a spline for… I would need to plot said creatine in relation to something before I can then go figure out what the appropriate number of nodes are needed. Since I am in the PK world, I would reflexively select the response variable which would be drug concentration. Also since I want this predictor in a model to describe Y isn’t the goal to have a spline that represents that relationship.

I feel I am screwing up a fundamental but basic fact in this process.

f2harrell · February 25, 2025, 9:14pm

With regression splines it’s mainly about specifying the degrees of freedom (which is limited by effective sample size, and once in a career, known linearity). The confidence bands will reflect the degree of flexibility you allow, e.g., # knots. The RMS book philosophy invites you to, in the absence of knowledge about linearity, use a number of knots k = \min(l, m) where l is the maximum number of knots you’ve needed in situations like this and m is the number of knots that the effective sample size allows you to use. Bottom line: if that doesn’t fit there’s nothing you can do about it anyway.

davidcnorrismd · February 25, 2025, 10:38pm

If your Y is drug concentration, then the discipline of PK gives you quantitative theories (PK models) that predict this already. Why would you substitute phenomenological models like splines? In dealing with noisy measures especially, you would want to make fullest use of whatever theoretical knowledge you have.

Matthew_Rich · February 26, 2025, 5:40pm

@davidcnorrismd

I just want to make sure we are referring to the same thing. When you state PK theory are you referring to (clearance, volumes, compartments etc…)? If so, yes that is always my starting point. I was drawn to the RMS package in trying to improve the sophistication in which I layer the statistical model (on top on of the underlying structural model.

All the theory I currently possess leads me to NLME model though in my work (though I will start looking at Bayesian approaches per Prof Harrell’s suggestion) I felt some of the tenants of the RMS package could help. I can see a direct benefit to some of the E-R analysis, looking at clinical readouts or AE risks, but I get lost trying to have it augment PK modeling. Maybe the answer is it doesn’t.

This question is largely been motivated comments on stepwise searches, which I do all the time. I was hoping to use some of the techniques in RMS to improve that process, because it does seem like a dice roll. Maybe applying a spline is not the way to go to improve the process, but I will keep reading and trying to improve my skills.

davidcnorrismd · February 27, 2025, 10:41am

Yes, definitely — the theoretical content supporting compartmental modeling, etc.

One way to clarify this layering would be to describe it formally, perhaps in the mathematical language of function composition. I gather that for you this composition is accomplished in sequential stages of analysis. But there are techniques that might help couple these parts of your model more tightly into a unified Bayesian model. I’m thinking particularly of particle filtering. The R package pomp comes with well written tutorial material; the compartmental applications come from the author’s field of ecology, but the principles translate readily to PKPD.

As for NLME, I’d add that my own persistent failure to ‘get’ NLME was what finally drove me into the arms of the Bayesians, as I describe in a comment to this CrossValidated answer.

Matthew_Rich · February 28, 2025, 3:09am

Thanks you for this. I am glad I am not the only one to not be satisfied by NLME modeling as the end all be all.

I have a lot to learn.