This is a topic for questions, answers, and discussions about session 1 of the Biostatistics for Biomedical Research web course airing on 2019-10-04. Session topics are listed here.
I added references for good experimental design and clinical study design books in Section 3.2.1.
Using means on ordinal data is a long running debate, but IMO it makes no mathematical sense. Because there is no spacing information, only order, you cannot assume the variance is finite on an ordinal measure, and you cannot assume that the spacing is “on average” equal via the CLT either.
Could someone expand on the information content section (3.6.1 in the course notes)? How exactly does the range in SBP of 50-250 contain 7 bits of information? The number 250 is 11111010 in binary which are 8 bits, if I’m not mistaken. Thank you.
The CLT does not apply here. We are not talking about tests and confidence intervals but rather about estimation and choice of summary metrics like means.
If the ordinal Y variable is a count, then the mean is a good idea. If spacings “almost” mean something then the mean can also be useful. If the spacings are subjective (e.g., unsatisfied, somewhat satisfied, … very satisfied) the means can still be a good idea, in the sense that alternatives to the mean are worse. For example one can get confusing results but reporting all cell proportions in a 5-point visual-analog scale but the mean is as good a summary of the 5 as we know.
It is not straightforward to compute the amount of information. I was not thinking of SBP as going 0-250 but going 50-250. You calculation of 8 bits may be more reasonable.
@daszlosek - this is a paper that was discussed on datamethods.org (I just can’t find the thread now), which uses a combination of nice techniques including accelerated failure time models for survival modeling and fitting with restricted cubic splines. Using these techniques the authors create a nomogram/model to predict tumor progression-free survival for patients with GI neuroendocrine tumors and then validate it in a separate distinct cohort. It seems to apply a lot of great techniques pretty rigorously.
There are couple of relevant points the authors derive from their statistical treatment. As background, neuroendocrine tumors are typically divided into three main prognostic categories based on the fraction of cells within the tumor that express a protein called Ki-67, which marks cells that are actively dividing. The model in the paper suggests (as @f2harrell notes many times in his notes) that this dichotomization of Ki-67 in fact has no biological basis, the prognostic impact of Ki-67 is continuous and it is NON-linear AND time-dependent. So this is precisely the situation, where we are losing a lot of information strictly speaking.
In clinical practice, when I see patients with these tumors in my clinic, the ability to have three categories gives me a rough heuristic that can serve as a guide for discussion with the patients. The medical oncologists use this number as one of the criteria to decide how aggressively to treat the tumors with current guidelines suggesting that patients with Ki-67 > 20% should be considered to have true neuroendocrine carcinomas eligible for cytotoxic chemotherapy. The model derived in this paper suggests that picture to be more complex as expected.
This is not fully related to the first session of BBR so you may want to move it to a new topic, e.g. “Case studies in flexible multivariable modeling”. But the paper looks excellent and I’ve added it to my file of modeling case studies. Session 1 did deal with dichotomization of Y.
For the question at the end on dichotomizing response variables - what if the data is compositional? Say, the percentage of given tumor patterns in a slide as a percentage of total mass. I know that these methods have been developed/explored in geology, but have biostatisticians done much work on them?
I have seen “total unit mass” as a possibility, but that comes with its own assumptions.
Is it generally good advice to instead present the median instead of the arithmetic mean, and would the median (or mode) give rise to similar issues that come from using the cell proportions? Any literature you can point to would be appreciated.
I’ve generally stayed away from means for less than interval-scaled measurements (“How much do you agree with this statement?”, for example) and favored the median (but haven’t actually used the mode). Interested to hear some thoughts about this.
We cover some of this in Session 2. I use the median but it’s not very efficient in many cases. I only use the mode for categorical variables. I find that the mean still works pretty well when the variable is not interval-scaled, just in the sense that summaries other than the mean are more confusing or too multi-dimensional for certain discrete ordinal variables with more than 3 levels.
Section 3.7 Preprocessing (and 31:00 of the recording) mentions that change from baseline can be appropriate in certain crossover studies. Can you expand on this? In what situations would it be appropriate, and why? Is this assuming a separate baseline measurement per period, or a single baseline before the first period?
Good question about something I didn’t say very clearly. In a two-period crossover study we often compute the treament B minus treatment A difference. But I was misleading in that neither of these measurements are really baseline values. They are both post-treatment or during-treatment.
I was interested in the comment in section 3.7 r.e. adjusting measurements in an animal experiment on the basis of measurements in the control group. When comparing different infecting strains under a given treatment condition in an infection model this might be difficult because the strains might have different fitness and therefore growth rates under control conditions. Is there a more appropriate way to deal with this issue?