BBR Session 7: Study Design, and Crossover Experiment Example

This is a place for questions, answers, and discussion about session 7 of the Biostatistics for Biomedical Research airing 2019-12-20 and dealing with study design and a two-period crossover experiment experiment. Session topics are listed here. The video will be in the YouTube BBRcourse Channel and will be directly available here after the broadcast.

Study design areas covered in this session include:

  • Sizing a pilot study using precision of SD
  • Problems with standardized effect sizes
  • Choice of effect size
  • Multiple hypotheses and reasons not to multiplicity-adjust
  • More about ordinal outcomes
  • Study design big picture
  • Crossover experiment example

Pre-session question: Stephen Senn has shown that in a two-period two-treatment crossover design, the estimated carryover effect has a correlation of 0.5 with the estimated treatment effect, so testing for a carryover effect and doing something about it results in gross inflation of type I error for the treatment effect test. This approach implies that carryover effect is either trusted or ignored, with nothing in between. What about a Bayesian approach in which a skeptical prior is placed on the carryover effect? In effect this will limit the confounding of the treatment effect by the carryover effect. Is this reasonable? Has anyone done this? It would be nice to get a posterior probability of existence of any carryover effect that is properly skeptical.

1 Like

You will find a brief discussion of the issue in section 6 here. Andy Grieve did some interesting work on this but in my opinion it falls short of being fully Bayesian. That would required at least two things a) paying close attention to time in the light of biology. How long is the time from when the previous treatment was given to when the current treatment is measured? b) Realising that other things being equal, the smaller the treatment effect the smaller the carry-over effect is likely to be. I am not aware that anybody has dealt with this properly.


This is extremely helpful Stephen. Thanks! I had never thought of the likelihood that the carryover effect would be smaller the smaller the treatment effect. It would seem reasonable to have priors set up such that (1) the carryover effect must be lower than the treatment effect and (2) the probability density for carryover is increasing as you move from the treatment effect to zero carryover.

1 Like

Here are references provided by @AndrewPGrieve:


Trying to understand how ordinal logistic regression cuts through the multiplicity problem when we relax the proportional odds assumption. What if we completely relax the assumption?

That’s a good question. If you assume proportional odds (PO), then you have a one d.f. test that clearly has no multiplicity problem. If you were to completely relax the assumption, i.e., to not use a partial proportional odds model, you’d have polytomous logistic regression with k-1 d.f. for treatment in a likelihood ratio \chi^2 test. This test would also have a perfect multiplicity adjustment. It’s only when you don’t connect all the endpoints into a single model that you’d even entertain the need for a multiplicity correction.

I would see confidence intervals on odds ratios as the optimal endpoint. If we have the polytomous logistic regression, we now have one odds ratio for each outcome group (minus 1). Family-wise adjustment on CI I believe is less typical than family-wise adjustment of p-values, but to me it seems just the other side of the coin.

I would say that choice concerning family-wise adjustment is meaningless prior to decision-making, although this could be argued if you consider what constitutes decision making. During decision making, hypothesis testing at either a family-wise or separate error rate is sub-optimal because neither optimizes relative to cost/benefit. In Bayesian framework, optimal method would be to integrate cost/benefit over the posterior.

Unfortunately, the lack of consensus around family-wise adjustment takes a lot of time and thought energy. You might recommend that everyone always consider a cost function and integrate over the posterior, but often in many cases the potential decisions are unclear to those conducting the research.

So is the best idea to spread information: flip a coin to decide on family-wise adjustment. At least people wouldn’t have to waste time thinking about whether or not to do it.

I think the problem with this is that people actually use hypothesis testing to make decisions. As a result, seems like if you want to reduce the chance that people are going to make really bad decisions based on p-values (posterior probabilities) or confidence intervals (credible intervals), you would want them to be calibrated relative to at least a plausible cost/benefit scenario for your audience.

This is sometimes how I roughly think about whether to adjust for multiple comparisons. Do we expect the error rates to be so unbalanced (e.g., Type 1 errors have so much higher probability) that we need to do something to get the two roughly in the ballpark of each other. If so, then adjust.

I think a good systematic way to suggest analysis be done is: Stage 1: Do not adjust for multiplicity as a general rule as long as there is no selective reporting; Stage 2: Report expected error rates given stated assumptions and/or have a toy decision making example based off of a simple cost/benefit function that seems to fit right in the middle of the typical use-case for your audience. The decision making example could even be constructed where the decision (yes/no on the y-axis) is plotted as a function of some fixed cost (on the x axis) assuming that such a simple cost function could be even mildly representative.

Great thinking. I really like the consideration of cost/benefit in the decision rule as opposed to using an \alpha-level rejection region. Cost/benefit uses proper conditioning whereas null hypothesis testing by itself uses improper conditioning (doesn’t condition on the data but conditions on what you could never know: the status of the null hypothesis).

Regarding the discussion we started on chat during the session, where I stated that the point estimate of the effect should not be relied upon especially if the posterior or utility function is asymmetric, I thought of a clearer reason why not to rely on the point estimate.

Suppose that the utility function was symmetric and simple and that the posterior distribution was symmetric, and the credible interval (and confidence interval) for the effect of interest was very wide, consistent with having a small N. Then you would still have an optimum decision that maximized expected utility, but that decision is almost as bad as flipping a coin. The expected utility for each of, say, two possible decisions, are almost the same. We’d like to make the best available decision but we would also like to be confident, i.e., to know that the decision was not a 0.51/0.49 one.

Regarding the point estimate, I would say it depends. If it is a forced choice decision, then the uncertainty is irrelevant after integrating over the posterior (it just so happens that the case we discussed, the integration doesn’t depend on the width of the posterior, just the mean). However, I’d have to agree with you in the more real-world case if the decision is not a forced choice (e.g., you could make the decision one way or the other, or you could wait for more data), then the uncertainty is completely relevant, indicating that are future decision may be substantially improved with more information.

Concerning standard use of cost/benefit as opposed to family-wise adjustment, I wonder if you think it would make sense (at some point) to start a separate thread, maybe titled: Retiring multiple comparisons adjustments based on improper conditioning, providing statistical model descriptions in standardized unadjusted format and accounting for multiple comparisons through simplistic decision making scenarios. Maybe this is too extreme. Its just a thought. Decision of when to use multiple comparisons adjustments has been bothering me for some time.

The only part I’ll disagree with is that I think the flatness of the posterior is essential in understanding how much confidence one should have during decision making, whether time is of the essence or not. Regarding your second paragraph, having a separate topic on multiplicity adjustment is probably a good idea.

I would earnestly appreciate clarification or elaboration of the admonition in BBR section 5.12.5 (minute 40.35 of the video): “Never use change from baseline as the response variable except in a non-randomized pre-post design (the weakest of all designs)”. This reiterates the emphatic caution about change-from-basellne (CfB) expressed elsewhere in BBR; RMS; Altman in BMJ, 2001; etc. But, as expressed here, the qualification ‘except in non-randomized pre-post design’ seems to imply permission or exemption in the case of observational pre-post designs—which perplexes/confuses me.

Observational pre-post designs with CfB analysis are very prevalent in health research and Pharma’ (especially now with expanding interest in the impact of treatment on PROs), and likely a great source of scientific noise, confusion and waste. But like significance testing, p-values, etc., CfB is so prevalent it is assumed to be natural, a conventional standard, and legitimate; and making a cogent case for eschewing CfB and embracing alternative approaches is a challenge for advocates of progressive scientific methodology.

I should note that many/most of these CfB analyses are in uncontrolled, single arm observational cohorts. While it is almost axiomatic that CfB is NOT = treatment effect; many construe CfB as indicative and significant, even in uncontrolled studies with pre-post- comparisons.

In the case of ‘non-randomized pre-post designs’ modeling the raw outcome variable of interest (post) with adjustment for baseline (pre) and other covariates to account for heterogeneity and extraneous variance (ANCOVA) is recommended by Frank and others.

Q1: is there a reason for the qualification “except in a non-randomized pre-post design”----is there something I do not understand that makes this instance extenuating?

Q2: are there other approaches that should be considered or advocated for in addition to ANCOVA?

Q4: What are these essential /major principles underlying the criticism of CfB that will most trenchantly discourage attachment to this practice?

Q5: Can anyone share/direct me to simulations that help illustrate the defects, risks and principles underlying the criticism of CfB that anyone is aware of? (e.g., Or compelling parables: examples of epic failures of CfB.

It will be a difficult campaign to effectively change understanding and practice. I am very grateful for advice, guidance, and input.

I am very very grateful for BBR!!

1 Like

Replying to myself: to my embarrassment, I see that this query has already be substantially addressed in another thread on, and I refer people to it here: Exemplary QOL analyses that avoid change-from-baseline blunders?.

Additional advice still welcome.

Thank you.

Drew here’s my take on your main question:

  • Pre-post designs should not be used
  • If someone threatens you if you don’t analyze a pre-post design, and you feel compelled to give in to the threat, your only choice is to compute CfB and do a Wilcoxon signed-rank test or one-sample t-test

Hidden in bullet point 2 is that you must have evidence that CfB is a valid change score, i.e., that the variable was properly transformed before subtraction so that the CfB is independent of B.


is this a reasonable proposition: assuming several outcomes with no inherent ordering, ie true co-primary outcomes, we can say:
if they are asking different clinical Qs then no need to adjust as per cook & farrell and we usher away multiplicity concerns; and if they are getting at the same thing then we obviate multiplicity by combining them in a composite which becomes our primary outcome (we analyse them separately as secondary outcomes)

i must check Mayo’s SIST …

I think that’s a fair statement. But it may be a little more clear to think about it in this additional way:

  • If one wants to answer separate questions with a pre-specified clinical/public health/patient importance ordering, answer those questions separately without multiplicity adjustment
  • If one is using a frequentist design and wants to show that there exists an endpoint for which the treatment provides a benefit (AKA fishing expedition or exploratory study), then use a very conservative multiplicity adjustment
  • For the latter if using a Bayesian approach, pre-specify the skepticism for each endpoint and trust all the posterior distributions as they stand. Or capitalize on Bayes to put more structure on the problem to answer more interesting clinical questions such as:
    • What is the evidence that the treatment improves more than 4 out of 12 outcomes?
    • What is the evidence that the treatment dramatically improves any outcome or provides any improvement on more than 4 out of 12 outcomes?

Such Bayesian options are described further here.


Section 5.12.5 (and in a previous session that referred to section 4.1.2) describes using proportional odds models to analyse a composite outcome made up of a continuous Y and categorical outcomes (eg, physical function measured 0-100 with 101=too sick or 102=death).

Can anyone recommend any further reading on this (eg, a comparison of this versus other methods) and/or good examples of this method used in practice?

See this preprint

1 Like

Thanks. Any references to work on its use with continuous outcomes like the creatinine example in the course notes would also be welcome (eg, compared to other methods of dealing with not applicable/other cases (eg, hurdle models?)).