RMS General Regression

The deeper question is whether having a better fit is worth it, i.e., you need to consider the variance-bias tradeoff. Thinking more shallow, you can compare a spline fit with a nonparametric smoother as a feel-good check, especially if there are no other covariates.

We need to think more about the impact of lack of fit and of fitting more parameters, rather than just thinking about goodness of fit. For more see this.

A higher p-value for non-linearity indicates that the restricted cubic spline is closer to a straight line; however, sometimes the graph differs from the p-value for non-linearity. For example, in the case of C, I perceive it as not being a straight line, yet its p-value for non-linearity is 0.44.
Also, sometimes the p-value is less than 0.05, but it appears to be a straight line.

Associations of serum vitamin C concentrations with risk of all-cause and cause-specific mortality among individuals with and without type 2 diabetes

As for me, I do not trust the p-value, so I prefer to decide the linearity based on the graphics. Different sample size may explain the inconsistent. In the case of C, I don’t think it represents a straight line.

I would like to know your insights when the restricted cubic spline and p-value for non-linearity display contrasting results regarding linearity.

You are right to place little trust in p-values. Here a large p-value means that you did not assemble enough evidence to contradict a supposition of linearity. This is far from saying that there is evidence of linearity, especially since large p-values can result from small sample sizes. The confidence bands tall you more.

It’s better to assume that everything is nonlinear and then to just quantify how much of the estimated relationship is nonlinear. So perhaps more valuable quantities to put on the plots would be partial pseudo R^2 for the likelihood ratio chunk test of association for each variable, and what proportion of the likelihood ratio \chi^2 is due to estimated nonlinear effects. For more about R^2 see here.

2 Likes

Thank you very much for the clarification, professor harrell.

Assuming

My outcome(y) is a continuous variable, for example : Children’s IQ score.
I am fitting a simple linear model to predict y, and one of the term in my model is an interaction between , lead content, exposure(continuous y) and child sex(1Male, 2Female), the reference level is 1Male.

If these are my estimates from fitting this model

Terms                     Estimates   Std.Error      p.value
Intercept                 2012.12       1234.56       0.1234
Lead(Exposure)              -6.5432       4.321       0.1122
Sex(Female)                -30.98        112.12       0.0005
Lead(Exposure)*Sex(Female)  21.12          4.321      0.013

How do I correctly interpret the estimates from the interaction term., including their CIs (Confidence intervals). I have some ideas about this but I have been looking at data far too long and became a bit rusty on few other things. Please pardon my ignorance. Thanks

It’s hard to explain the interpretation of the estimate for the interaction term without first looking at the the other estimates. As the reference group for sex is set to “Male” and since you haven’t centered the lead exposure the interpretation of the coefficients are as follows:

  • Intercept: The estimated IQ-score for a male with a lead exposure of 0. I.e. the estimated IQ-score for such a man is 2012.1!! This is hardly realistic. If you want an estimate that makes more sense you could mean-center the lead exposure variable and re-run the analysis using this variable instead of the actual lead exposure. This will have the effect of changing the interpretation of the new intercept to be the estimated IQ-score for a man with an average/mean exposure to lead.

  • Sex(Female): The estimated difference in IQ-score between a male and a female that both have a lead exposure of zero. I.e., the estimated IQ-score is 31.0 points lower for the woman than the man. If, as suggested above, you mean-center the lead exposure and rerun the analysis, the interpretation of the new estimate becomes the difference between a man and a woman that both have an average/mean lead exposure.

  • Lead(Exposure): The slope of the regression line for men. I.e., it is the difference in IQ-score between two men where one has a lead exposure that is 1 unit higher than the other. The person with the higher lead exposure will have an estimated IQ-score that is 6.5 points lower than the person with the lower lead exposure.

  • Lead(Exposure)*Sex(Female): The difference in slopes between the regression line for men and women. I.e. the slope of the regression line for women is 21.1 higher than the slope for men, which means that the slope for women is 14.6 (= -6.5 + 21.1). An approximate 95% 2-sided CI for the difference in slopes is given by 21.1 +/- 1.96 x 4.3 = [12.6 ; 29.4].

For how to correctly interpret the p-value and 95% CI I suggest you take a look at this thread: https://discourse.datamethods.org/t/language-for-communicating-frequentist-results-about-treatment-effects/934

4 Likes

Dear @f2harrell ,

after using a restricted cubic splines model in a manuscript, I was questioned why I did not implement smooth splines or LOESS, and that using 0.1, 0.5, 0.9 for the knots is ad-hoc.

I am arguing that

  • RCS is better for inference for general audience
  • it is general practice to place the knots there
  • using three knots is based on the sample size I have
  • it is easy to compare the models with linear ones via Likelihood-ratio test to derive p-values.

Can you recommend any further points or literature for this issue ?

Thanks a lot in advance!

1 Like

I’d say that

  • based on your effective sample size you wanted to spend two degrees of freedom (parameters) in estimating the extent and shape of association
  • there was no subject matter to guide the specification of the shape of the relationship
  • placing knots at quantiles as you did puts an equal sample size between any two knots, following the idea that you allow for shape changes where the data are rich enough to be able to estimate the relationship
  • regression splines provide equations and simple inference (unlike smoothing splines)
  • regression splines provide estimated curves that are usually in great agreement with estimates from nonparametric regression methods or smoothing splines
3 Likes

I am fitting a linear mixed effects model, the outcome is continuous and i have 5 covariates. One covariate is a discrete variable with 5 levels.

The model converges and I get meaningful estimates for all parameter , but a sub level in the discrete covariate(X5, level 4) has NA for standard error

factors           estimates      SE      statistic        df         p-value
Intercept       10.123           1.45    2.56            100.45  0.67
x1              0.4567           1.46   2.45             97.56   0.34
.
.
x5-level2      0.1234            1.23   3.45             96.12    0.12
x5-level3      0.1114            1.19   1.45             92.12    0.11
x5-level4      0.0223            NA    NA                NA       NA

I am wondering what could be the cause for NAs in just one sublevel of a factor variable.

Thanks.

What are the frequencies of all the levels of x5?

Hi Frank good question,
L1,L2…L5 are the levels in my covariate x5
A,B,C,D are the centers which is the random intercept in my model.

image

L3 is kind of a K thing so maybe that degree of collinearity between L3 and a random effect is a problem?

1 Like

Thanks Frank that is in line with reality, some center strictly use one approach(levels of x5). So your suggestion makes sense.

Hi Frank you said in your lecture that AIC is useful for evaluating model complexity (linear vs non linear or interaction) what about Likelihood Ratio Test, is that not useful when comparing simple vs complex models ?

There are three approachs: LR test (F test in ordinary regression), AIC, and R^{2}_\text{adj}. LR is more compatible with R^{2}_\text{adj} but still different. They ask whether the bigger model is worth the extra parameters, in terms of statistical information. This is an in-sample assessment, i.e., they assess predictive discrimination for the current matrix of all predictor values. R^{2}_\text{adj} penalizes for p, the number of predictor parameters. AIC penalizes for 2p which contains an additional p penalty for asking about out-of-sample predictive discrimination. Minimizing AIC is an attempt to achieve the best log-likelihood in future samples when \hat{\beta} is frozen from the original sample.

2 Likes

This question is regarding centering, I have worked on studies where I was requested to center some continuous variables and in few other studies where we did not center the variables. Also if I remember correctly LASSO models requires centering continuous variables, what is the correct approach here ? when should we center the variables before analysis and when it is okay to ignore centering ? Thanks.

No methods I know of require centering predictors unless you are using very old linear algebra software in fitting. lasso depends only on careful scaling.

1 Like

Thanks Frank, may be this has more utility in interpretation ?

No, except possibly in interpreting interaction terms. But I don’t interpret them alone anyway.

1 Like

Dear prof. @f2harrell, I’m trying to reproduce the results which are obtained using the summary() function. Specifically, I want to manually estimate the standard error of the interquartile coefficient of a continuous variable included in a model fitted with cph.
I did understand how to do that starting from vcov(model) in order to get the variance associated with my predictor of interest.
My question is how to do that when the continuous variable is fitted using rcs(). In that case, vcov(model) generates a matrix in which every splines terms is included. Starting from this, how can I estimate the standard error of the continuous variable?

require(rms)
require(survival)
fit ← cph(Surv(dtime, death) ~ rcs(age, 4) , data = rotterdam, x = TRUE , y=TRUE)
vcov(fit)

Thanks