Design ideas for non-inferiority trials were event rates are low

In oncology, we often encounter situations where the event rate is quite low and de-escalation trials are planned to reduce toxicity and potentially improve QoL. In cancer sites with a large patient population, non-inferiority trials using traditional frequentist approaches are possible. However when the cancer is not so common this becomes a problem. My question is how to design these trials safely.

A practical example.

We know that in patients with carcinoma cervix who are node negative (ie. the cancer has not metastasized to pelvic nodes), the risk of recurrence in the pelvic nodes is quite low (3 - 5%). Most of the toxicity that the patients encounter during treatment (like nausea, vomiting, diarrhea etc) are related to the volume of bowel and bone marrow exposed to radiation. We have high quality data from several randomized trials as well as prospective studies, that reducing the dose to these areas can result in meaninful reductions in toxicities and improvement in patient’s quality of life. Not only does acute toxicity reduce but late toxicity also reduces.

Traditionally we have been prescribing a pelvic radiotherapy dose of 45 Gy in 25 fractions over 5 weeks to these patients (Gy is the unit of dose and fractions is number over which the total dose is divided). Given the low event rates it is tempting to design trials where dose de-escalation will be considered.

A standard frequentist design non-inferiority trial would involve in excess of 2500 patients (depending on the non-inferiority margin). This is obviously inefficient and difficult to conduct.

What is currently done

A good example of a similiar trial design is the UPGRADE-RT Trial (Uniform FDG-PET guided GRAdient Dose prEscription to reduce late Radiation Toxicity (UPGRADE-RT): study protocol for a randomized clinical trial with dose reduction to the elective neck in head and neck squamous cell carcinoma | BMC Cancer | Full Text). The study protocol has the following details for sample size calculation:

  1. Primary endpoint : Reduction in toxicity - ‘normalcy of diet’ at 1 year after treatment, measured using the performance status scale for patients with head and neck cancer (PSS-HN) -

This study was designed to detect a 10-point difference on the PSS-HN ‘normalcy of diet’ score at 12 months after radiation therapy with a power of 90% at a two-sided significance level of 0.05. An average ‘normalcy of diet’ score of 70 is expected after standard treatment. To achieve this significance level with an unequal randomization ratio (2:1), a total of 300 patients needs to be included.

  1. Secondary Endpoint : actuarial rate of recurrence in electively irradiated lymph nodes at 2 years after treatment.

The current rate of recurrence in electively irradiated lymph nodes was estimated to be 5% at 2 years after treatment [17]. An equal rate of recurrence is expected in the intervention arm, despite elective dose de-escalation. A recurrence rate of ≥10% will be considered clinically relevant and unacceptable. This difference can be detected with the number of patients planned for the primary outcome of the study and a one-sided α = 0.10.

The question

As can be seen above the results of such trials are difficult to actually implement in practice as they allow the experimental arm (dose reduction) to have a nodal failure rate that is double that of the control arm. Practically, most oncologists will not accept this high failure rate. However practically in most studies, the actual absolute difference in the recurrence rates between arms is around 1- 3%.

What needs to come out from the study is how probable that we will see a failure rate in excess of say 5% given the result of the trial. I would like to know if it is possible to design trials that allow us to make this conclusion ? Maybe using Bayesian trial designs ? Note that recurrence is actually a time to event endpoint in these studies and not a binomial endpoint.

An even better endpoint may be a composite endpoint where estimate the probability that the patient will be alive, disease free and toxicity free at say 2 - 3 years. How much worse can we expect the dose de-escalated arm to be.


Bayesian designs can always help but in this case the worth of using a high-resolution endpoint will swamp any other statistical issue. Having one bit of information in an outcome variable makes that it has the lowest possible amount of information other than an endpoint that no one has (incidence = 0). A high-resolution composite endpoint would make a world of difference sample-size-wise and would help us get away from seeking miracles as is done when event sizes are specified that are more than clinically relevant.

It’s a useful mind game to create an ideal endpoint then try to approximate that with a realistic protocol. Start by considering a patient-oriented outcome scale with 100 levels that is shown to have high test-retest reliability, and use it to make weekly assessments for 2 years, with clinical events overriding the 100 levels, adding levels 101 and 102. The power and interpretability gains will be dramatic. Then scale it back to realism seeking the best bang for the buck.


Thanks Dr Harrell.
To understand this clearly I will go through the example study protocol. Please do correct any amateurish mistakes I make along the way.


Two group parallel randomized trial with equal allocation in each group.


Patient with node negative cervical cancer planned for definitive radiotherapy.


Control Arm : Pelvic external beam radiotherapy to a dose of 45 Gy in 25 fractions over 5 weeks along with 5 cycles of concurrent chemotherapy and brachytherapy.
Experimental Arm : Pelvic external beam radiotherapy to a dose of 41.4 Gy in 23 fractions over 5 weeks along with 5 cycles of concurrent chemotherapy and brachytherapy.


Difference between Control Arm and Experimental arm at 2 years for survival and patient reported outcomes.


A composite patient reported endpoint created using the EORTC QLQ C30 and CX24 scales.
The EORTC QLQ C30 questionnaire has 30 items related to the following domains:

  1. Global health QoL
  2. 5 functional domains
  3. 8 symptom domains
  4. 1 on financial difficulty

Recently Geisinger et al have also proposed a summary score for EORTC QLQ C30 incorporating information from 27 of 30 items and ranges between 0 - 100 with higher scores indicative of a better outcome (better function and lower symptoms).

The EORTC CX24 questionnaire has 24 items that was designed to obtain the symptom burden of patients with cervical cancer which gathers additional information on:

  1. Symptom experience (13 items)
  2. Body image
  3. Sexual / vaginal function
  4. Lymphedema
  5. Peripheral neuropathy
  6. Menopausal symptoms
  7. Sexual activity and sexual enjoyment

Note that the proposed PROM questionnaire set captures information on most of the common symptoms patients of cervical cancer face due to the disease and its treatment. Each of the domains are again scaled on a score of 0 - 100 where 100 represents the best outcome.

Additionally there are the following clinical outcomes of interest:

  1. Death due to any cause (anticipated in about 10% - 15% patients at 2 years)
  2. Distant metastases (anticipated in about 10 - 20% of patients at 2 years)
  3. Para-aortic nodal recurrence (anticipated in about 5 - 8% patients at 2 years). A portion of these recurrences can be salvaged (approximately 20%)
  4. Pelvic nodal recurrence: anticipated to be about 3 - 5% at 2 years. This is more difficult to salvage than para-aortic nodal recurrence but local treatment approaches like SBRT.
  5. Local recurrence: anticipated to be about 2- 3% at 2 years. If isolated patients can be salvaged with exentrative surgery

Note that the local recurrence rates are unlikely to be different as the dose received by the cervix will be kept as the same with the use of brachytherapy.

Also noteworthy is that patients may have various combinations of recurrences

The ideal state would be one where the patient has the highest PROM with none of the recurrences.


We generally have PROMs completed weekly during the course of treatment (so 5 weeks). After that patients can complete PROMs on each followup visit which would be done three monthly for the first 2 years. At each visit we have:

  1. If patient is dead or alive
  2. Patient has recurrence at any of the specified site or not (multiple sites possible).
  3. PROM scores of the:
    a. 15 item summary score from EORTC QLQ C30 (0 - 100)
    b. 13 item symptom experience score from EORTC QLQ CX24 (0 - 100)
    c. Additional items from EORTC QLQ CX24 - each with score 0 - 100
    d. Global QoL from EORTC QLQ C30 (0 - 100) - this is a very poor and coarse differentiator and it takes a lot to make it change.

How can we combine these to make the composite endpoint ? Also a weekly assessment of the QoL during followup will not be practical. However a three monthly or even a monthly assessment is feasible.

1 Like

Another question is how do we provide weight for the outcomes. Death and disease recurrence are of course anydays worse than a small change in PROM.

1 Like

These kind of measurements are suited for ordinal longitudinal models. The clinical event overrides such as death are easy and no spacing is assumed between levels of an ordinal response variable. More difficult is how to combine more than one scale into a composite scale. This can be done by an expert ranking assessment exercise or by using purely statistical means. Or just pick the “best” PROM scale.

Sometimes the best approach is to regression multiple scales onto patient subjective global assessments using ordinal regression, to find out how to weight the various scales, and possibly how to transform them.

1 Like

Thanks once again Dr Harrell.
To understand this correctly, I will work through an example. For example if we give the following ordinal ranks to the clinical outcomes:

  1. Death = 0
  2. Alive with distant metastases = 1
  3. Alive with para-aortic nodal metastases = 2
  4. Alive with pelvic nodal metastases = 3
  5. Alive with local recurrence = 4
  6. Alive without disease = 5

This is easy to understand.
To simplify the PROM part we can take a single 15 item summary score from the EORTC QLQ C30 whose score will range between 0 - 100 with 100 representing the best possible outcome. Concievably when the patient is dead the PROM score is also zero.
Can we then define the endpoint like this ?

  1. Death = 0.0
  2. Alive with distant metastases = 1.x
  3. Alive with para-aortic nodal metastases = 2.x
  4. Alive with pelvic nodal metastases = 3.x
  5. Alive with local recurrence = 4.x
  6. Alive without disease = 5.x
    where x represents the PROM score at each time point.

Or should we define it like so

  1. Death = 0
  2. Alive with distant metastases = 1
  3. Alive with para-aortic nodal metastases = 2
  4. Alive with pelvic nodal metastases = 3
  5. Alive with local recurrence = 4
  6. Alive and disease free = 5 + PROM score.

In this example PROM scores are ignored completely in the event patient has any recurrence.
Which of these would be the better way to define this endpoint.

1 Like

Keep in mind we are using integers for convenience. The integers are not used in the analysis unless you try to estimate mean outcomes. The score could be 100-PROM overridden by 100+Y where Y is a 1-5 integer capturing the worst clinical outcome in a given time period, with Y=1:5 corresponding to local recurrence, pelvic, aortic, distant, death. PROM is in no need of a definition when there is an event override. A primary estimand would be expected time in Y < y | Tx=B, X minus expected time in Y < y | Tx=A, X for a variety of y and X. These are computed from summing state occupancy probabilities from a fitted state transition ordinal longitudinal model. For example we may be interested in \Pr(Y < 50 | t, Tx, X) which is the probability of being alive and disease free and having PROM \geq 50 at time t. Getting the area under the probability curve over time you get mean time in state. If death were the only event simple arithmetic translates this to mean restricted survival time.

If X does not interact with Tx, the statistical evidence for any benefit (e.g., Bayes posterior probability of any Tx benefit) is identical for all baseline covariate settings X. When one wants to quantify evidence for a non-trivial benefit, that evidence is dependent on X. Patients who are sicker at baseline will logically have more room for more absolute benefit than a less sick patient.


Thanks for that elaborate explanation, Dr Harrell.
To understand this more clearly I will take the scores of a single patient in the study

Time Point PROM Score Event Y Value
0 months 67 None 33
3 months 54 None 46
6 months 64 None 36
12 months 70 None 30
15 months NA Unknown Missing
18 months NA Unknown Missing
20 months 40 Pelvic Nodal Recurrence 102
36 months NA Death 105

Here the example is that patient did not turn up for the 15th and 18th month followup and presented with recurrence at 20 months and passed away at 36 months.

In this way we will have a series of longitudinal scores for each patient in the trial. Please let me know if I have understood this correctly.


That is correct, with the only improvement being that you have partial information if you know which events did not occur at 15 and 18m. If the only information you had was that the patient was still alive then their ordinal value would be left censored at 104. It’s not much information but it still counts in the log-likelihood function.

Advantages of this approach are the ability to handle missed visits, to handle partial information on one visit, and to not have to decide whether an early less severe event is worse than a later less severe event.