BBR Session 7: Study Design, and Crossover Experiment Example

I would see confidence intervals on odds ratios as the optimal endpoint. If we have the polytomous logistic regression, we now have one odds ratio for each outcome group (minus 1). Family-wise adjustment on CI I believe is less typical than family-wise adjustment of p-values, but to me it seems just the other side of the coin.

I would say that choice concerning family-wise adjustment is meaningless prior to decision-making, although this could be argued if you consider what constitutes decision making. During decision making, hypothesis testing at either a family-wise or separate error rate is sub-optimal because neither optimizes relative to cost/benefit. In Bayesian framework, optimal method would be to integrate cost/benefit over the posterior.

Unfortunately, the lack of consensus around family-wise adjustment takes a lot of time and thought energy. You might recommend that everyone always consider a cost function and integrate over the posterior, but often in many cases the potential decisions are unclear to those conducting the research.

So is the best idea to spread information: flip a coin to decide on family-wise adjustment. At least people wouldn’t have to waste time thinking about whether or not to do it.

I think the problem with this is that people actually use hypothesis testing to make decisions. As a result, seems like if you want to reduce the chance that people are going to make really bad decisions based on p-values (posterior probabilities) or confidence intervals (credible intervals), you would want them to be calibrated relative to at least a plausible cost/benefit scenario for your audience.

This is sometimes how I roughly think about whether to adjust for multiple comparisons. Do we expect the error rates to be so unbalanced (e.g., Type 1 errors have so much higher probability) that we need to do something to get the two roughly in the ballpark of each other. If so, then adjust.

I think a good systematic way to suggest analysis be done is: Stage 1: Do not adjust for multiplicity as a general rule as long as there is no selective reporting; Stage 2: Report expected error rates given stated assumptions and/or have a toy decision making example based off of a simple cost/benefit function that seems to fit right in the middle of the typical use-case for your audience. The decision making example could even be constructed where the decision (yes/no on the y-axis) is plotted as a function of some fixed cost (on the x axis) assuming that such a simple cost function could be even mildly representative.

Great thinking. I really like the consideration of cost/benefit in the decision rule as opposed to using an \alpha-level rejection region. Cost/benefit uses proper conditioning whereas null hypothesis testing by itself uses improper conditioning (doesn’t condition on the data but conditions on what you could never know: the status of the null hypothesis).

Regarding the discussion we started on chat during the session, where I stated that the point estimate of the effect should not be relied upon especially if the posterior or utility function is asymmetric, I thought of a clearer reason why not to rely on the point estimate.

Suppose that the utility function was symmetric and simple and that the posterior distribution was symmetric, and the credible interval (and confidence interval) for the effect of interest was very wide, consistent with having a small N. Then you would still have an optimum decision that maximized expected utility, but that decision is almost as bad as flipping a coin. The expected utility for each of, say, two possible decisions, are almost the same. We’d like to make the best available decision but we would also like to be confident, i.e., to know that the decision was not a 0.51/0.49 one.

Regarding the point estimate, I would say it depends. If it is a forced choice decision, then the uncertainty is irrelevant after integrating over the posterior (it just so happens that the case we discussed, the integration doesn’t depend on the width of the posterior, just the mean). However, I’d have to agree with you in the more real-world case if the decision is not a forced choice (e.g., you could make the decision one way or the other, or you could wait for more data), then the uncertainty is completely relevant, indicating that are future decision may be substantially improved with more information.

Concerning standard use of cost/benefit as opposed to family-wise adjustment, I wonder if you think it would make sense (at some point) to start a separate thread, maybe titled: Retiring multiple comparisons adjustments based on improper conditioning, providing statistical model descriptions in standardized unadjusted format and accounting for multiple comparisons through simplistic decision making scenarios. Maybe this is too extreme. Its just a thought. Decision of when to use multiple comparisons adjustments has been bothering me for some time.

The only part I’ll disagree with is that I think the flatness of the posterior is essential in understanding how much confidence one should have during decision making, whether time is of the essence or not. Regarding your second paragraph, having a separate topic on multiplicity adjustment is probably a good idea.

I would earnestly appreciate clarification or elaboration of the admonition in BBR section 5.12.5 (minute 40.35 of the video): “Never use change from baseline as the response variable except in a non-randomized pre-post design (the weakest of all designs)”. This reiterates the emphatic caution about change-from-basellne (CfB) expressed elsewhere in BBR; RMS; Altman in BMJ, 2001; etc. But, as expressed here, the qualification ‘except in non-randomized pre-post design’ seems to imply permission or exemption in the case of observational pre-post designs—which perplexes/confuses me.

Observational pre-post designs with CfB analysis are very prevalent in health research and Pharma’ (especially now with expanding interest in the impact of treatment on PROs), and likely a great source of scientific noise, confusion and waste. But like significance testing, p-values, etc., CfB is so prevalent it is assumed to be natural, a conventional standard, and legitimate; and making a cogent case for eschewing CfB and embracing alternative approaches is a challenge for advocates of progressive scientific methodology.

I should note that many/most of these CfB analyses are in uncontrolled, single arm observational cohorts. While it is almost axiomatic that CfB is NOT = treatment effect; many construe CfB as indicative and significant, even in uncontrolled studies with pre-post- comparisons.

In the case of ‘non-randomized pre-post designs’ modeling the raw outcome variable of interest (post) with adjustment for baseline (pre) and other covariates to account for heterogeneity and extraneous variance (ANCOVA) is recommended by Frank and others.

Q1: is there a reason for the qualification “except in a non-randomized pre-post design”----is there something I do not understand that makes this instance extenuating?

Q2: are there other approaches that should be considered or advocated for in addition to ANCOVA?

Q4: What are these essential /major principles underlying the criticism of CfB that will most trenchantly discourage attachment to this practice?

Q5: Can anyone share/direct me to simulations that help illustrate the defects, risks and principles underlying the criticism of CfB that anyone is aware of? (e.g., https://rpsychologist.com/treatment-response-subgroup) Or compelling parables: examples of epic failures of CfB.

It will be a difficult campaign to effectively change understanding and practice. I am very grateful for advice, guidance, and input.

I am very very grateful for BBR!!

1 Like

Replying to myself: to my embarrassment, I see that this query has already be substantially addressed in another thread on datamethods.org, and I refer people to it here: Exemplary QOL analyses that avoid change-from-baseline blunders?.

Additional advice still welcome.

Thank you.

Drew here’s my take on your main question:

  • Pre-post designs should not be used
  • If someone threatens you if you don’t analyze a pre-post design, and you feel compelled to give in to the threat, your only choice is to compute CfB and do a Wilcoxon signed-rank test or one-sample t-test

Hidden in bullet point 2 is that you must have evidence that CfB is a valid change score, i.e., that the variable was properly transformed before subtraction so that the CfB is independent of B.

2 Likes

is this a reasonable proposition: assuming several outcomes with no inherent ordering, ie true co-primary outcomes, we can say:
if they are asking different clinical Qs then no need to adjust as per cook & farrell and we usher away multiplicity concerns; and if they are getting at the same thing then we obviate multiplicity by combining them in a composite which becomes our primary outcome (we analyse them separately as secondary outcomes)

i must check Mayo’s SIST …

I think that’s a fair statement. But it may be a little more clear to think about it in this additional way:

  • If one wants to answer separate questions with a pre-specified clinical/public health/patient importance ordering, answer those questions separately without multiplicity adjustment
  • If one is using a frequentist design and wants to show that there exists an endpoint for which the treatment provides a benefit (AKA fishing expedition or exploratory study), then use a very conservative multiplicity adjustment
  • For the latter if using a Bayesian approach, pre-specify the skepticism for each endpoint and trust all the posterior distributions as they stand. Or capitalize on Bayes to put more structure on the problem to answer more interesting clinical questions such as:
    • What is the evidence that the treatment improves more than 4 out of 12 outcomes?
    • What is the evidence that the treatment dramatically improves any outcome or provides any improvement on more than 4 out of 12 outcomes?

Such Bayesian options are described further here.

2 Likes

Section 5.12.5 (and in a previous session that referred to section 4.1.2) describes using proportional odds models to analyse a composite outcome made up of a continuous Y and categorical outcomes (eg, physical function measured 0-100 with 101=too sick or 102=death).

Can anyone recommend any further reading on this (eg, a comparison of this versus other methods) and/or good examples of this method used in practice?

See this preprint

1 Like

Thanks. Any references to work on its use with continuous outcomes like the creatinine example in the course notes would also be welcome (eg, compared to other methods of dealing with not applicable/other cases (eg, hurdle models?)).

Hi All,

I want to hijack this thread to present a study design that I am working on and to seek input from the community on the best analysis approach.

My study wants to analyze changes in some key variables after patients switch over from placebo to active treatments. This is essentially a pre-post single-arm design. We are defining ‘baseline’ (t0) as the time of switch and a 16 wk (t1) post switch. The variables that will be assessed are continuous and categorical.

My approach is to report changes in mean with confidence intervals for continuous variables and differences in proportions for binary outcomes.

For continuous outcomes, a paired t-test will be performed and a confidence interval will be generated by inverting the t-test. I am hoping there is an implementation in the BBR notes that seamlessly execute this step. :slight_smile:

For categorical variables summaries, I want to know if the prop.test() will yield the right statistic in the context of dependence as we will be using paired data. My checks from Agresti show that the difference in the proportion statistic and standard error derivation only make assumptions of independence on the outcome and not the predictor, so I was wondering if this is an appropriate test to report. Probably reporting McNemar will be the best approach but I do not know of a function in R that will invert and report the 95% CI for the differences with McNemar statistics.

Looking forward to other insights and suggestions.

For continuous variables you might use the more robust and efficient Wilcoxon signed-rank test and its associated point estimate the pseudomedian (median of all possible pairs of differences from pre to post). For binary variables McNemar’s test does seem applicable, and get a confidence interval for a probability that is part of this test. If memory serves, McNemar’s is based on the following if X is pre and Y is post: P(Y > X | Y \neq X) = P(\Delta > 0 | \Delta \neq 0) where \Delta = Y - X. You can compute Wilson’s confidence interval on this probability after removing all the tied observations. You can also get confidence intervals on unconditional probabilities e.g. P(Y = X).

1 Like

Hi Professor @f2harrell,
Apologies for the mistake, sometimes it’s not clear to me where to post my questions. I hope this is now placed in the correct topic. Please let me know if that’s not the case.

50 subjects received 2 interventions. The order of the interventions was randomized. The subjects were assessed before each intervention (Pre), immediately after (Post_1), and then one more time (Post_2). Each experimental session (Pre + Intervention + Post_1 + Post_2) lasted about 1 hour and took place on a different day (2 days total). The attached Plot shows the mean response (with a 95% nonparametric confidence band) on each time for each intervention.

Rplot

The interventions are the same, but one is a “stronger” version of the other.

I’m a bit confused with this design as it includes 2 within-subjects factors, namely, intervention and assessment time.

The goals are:

  • To test for an intervention effect from Pre to Post_1;
  • To test for a different intervention effect between the 2 interventions;
  • To test for a “recovery” effect from Post_1 to Post_2 (i.e., what value the response returns to);
  • To test for a different “recovery” effect between the 2 interventions;

I tried to model these data using the approach suggested here: Biostatistics for Biomedical Research - 7  Nonparametric Statistical Tests

For exemple:

f <- ols(response ~ intervention * time + id)

Where “intervention” and “time” are modeled as categorical variables.

But I’m not sure this is correct, because I have 3 measurements per subject for each intervention, whereas in the BBR example each subject has only 1 measurement per condition (drug).

I think I could reduce the 3 measurements per subject to only 1, using for instance the AUC for each subject’s curve. But then I don’t know how I could address the goals from above. By comparing the mean AUC between the 2 interventions I would simply be testing for an overall difference between the interventions. I think this would be a limitation of any summary measure.

Does anyone have any suggestion?

Thank you.

Dear Professor @f2harrell,
Section 7.9 of BBR says that Regression can be used to analyze multi-period crossover studies. Are there any examples available on BBR, RMS, or here? I’m struggling with the case above.

No examples. Take a look at Stephen Senn’s book on crossover studies. I’ll bet the prevailing way to handle that is using subject random intercepts just as you can do with 2 periods.

1 Like

Ok Professor,
Thank you!

Dear Professor @f2harrell,

I’m a bit confused with regards to the analysis presented in Section 7.8 of BBR. There we have a case of a PO Model being applied to a type of experiment where one of the factors (surface) seems to be a within-subjects factor. However, the model does not include ‘subject’ as a predictor and thus it does not seem to be taking into account the repeated measurements within subject:

f <- orm(y ~ sex + surface, data=d, x=TRUE, y=TRUE)

Am I missing something?

Thank you for your time.

I was assuming that the two surfaces yield uncorrelated observations.

1 Like