# Reduction of highly correlated variables in radiation oncology

I am unaware if anyone has been “playing” around with data from radiation oncology before. However, often we wish to make models to predict the risk of a patient getting some kind of toxicity after radiation. Basically what happens is that a patient has a CT scan, and then internal structures/organs are delineated on the CT scan, and a treatment/radiation plan is the constructed based on this CT scan and structures. In that way we have, even before the actual patient is treated, a simulated treatment plan of how the radiation dose is distributed inside the patient. And hopefully, this is also how it will be when the patient is actually treated (although there are many uncertainties connected with this - but that is another matter).

So what we can do is calculate (in advance) different metrics based on this simulated treatment plan. For example, if we were to treat a patient with prostate cancer, the treatment plan will consist of a conformal dose to the prostate, and due to the nature of radiation we cannot fully avoid other organs like the rectum or bladder. They (and other normal tissues) will receive some amount of dose. And it is this dose to the healthy tissue/organs that eventually result in some form of complication/toxicity (maybe in combination with some other factors), which doctors later on can detect, or see, i.e. we have an event.
And these metrics can, as stated, be calculated in advance, before the patient even has his or her treatment, based on the simulated plan. So from that I can ask the question: “How much dose/radiation does 2% of the rectum volume receive”, and calculate that for each patient. I could also ask: “How large of the rectum volume receives x-amount of dose”, and so on. And we can do that in many steps if we like (e.g. x-amount of dose, x-amount + 5 of dose, x-amount + 10 of dose, and so on…), and get different answers for each of course. However, as you can imagine, these different steps will be highly correlated (at least most of them).
However, these steps/thresholds are somewhat important, because that is what we base constraints on. For example, we know that certain organs shouldn’t get more than x-amount of dose, or else the patient will most like die, or a toxicity will occur. So we do wish to seek out this “threshold”, IF they have an effect of course. Because today, most choices are based on single constraints, and not risk. So the idea is of course to put these thresholds into a model, together with (maybe) other important factors.

And here is my problem: How do one model through these highly correlated variables ? In my current data set, where I am trying to do a prediction model, I have maybe 10 “normal” variables (i.e. these are not calculated, but actually collected (patient characteristics), or they stand alone and not together with other similar calculated variables (like explained above)), and then 70 other variables of the types above.

It’s a rather large data set (at least within my field), with almost 1200 patients, and 80 events (response variable). But I am really unsure about how to approach this. Ideally I would like to end up with a few variables (2-5), where some of them are the independent variables, and then one/the best variable from the highly correlated groups.

I have tried step wise selection (which I know I shouldn’t). But that really doesn’t reduce much, and it doesn’t help me with getting the “right answer” for the correlated groups, since many are still left. I have tried a redundancy test, but that removes way too much, and then ones that are left seems off to me (at least compared to what other usually report in other articles). I have also tried PCA, but the result was a first principal component with almost all the variables from the correlated groups. And then a few other things, that didn’t seem to work out the way I wanted at least, and tried to do what others have done before, like for example as described in the M&M in this paper (https://sci-hub.tw/10.1016/j.radonc.2016.04.005 (just type the captcha to see the article))

But maybe I am hoping on to much of a reduction, I don’t know.

So basically I am looking for some advice on how to proceed, if that is even possible. I can post my data set if that makes a difference, or helps.

I think someone will give you an answer more tailored to the problem, but my initial thought is to recommend sparse principal component analysis, e.g. R package `pcaPP`.