BBR Session 3: Statistical Graphics and Tables

f2harrell · October 15, 2019, 2:39pm

This is a topic for questions, answers, and discussions about session 3 of the Biostatistics for Biomedical Research web course airing on 2019-10-18. Session topics are listed here.

Sorry I didn’t get to the brief section on tables. I’ve added a new short video about tables here.

Sorry about the audio problem. It was corrected at the 2:30 point in the long video.

daszlosek · October 18, 2019, 3:18am

I have tried twice (with no success) to convince co-authors that extended box plots increase the level of information about the distribution (and that having such information is useful!). In the end, we have settled for violin plots with overlayed box plots instead. Their defense is that it leads to decreased interpretability.

I was wondering if anyone else has had difficulty when attempting to incorporate new visuals (half-violin plots, extended box plots, spiked histograms, etc.) into a field that has mainly used tables, scatter-plots and box plots to describe their populations?

f2harrell · October 18, 2019, 4:50pm

It’s a great question. I haven’t faced much resistance.

I would stick with spike histograms supplemented by quantile intervals and mean, because such histograms are immediately interpretable, expose bimodality and digit preference, and work for all sample sizes.

mdooris · October 20, 2019, 6:52am

Thank you for providing this course (particularly offline). In the 3rd session, the cumulative event curve difference “1/2CL” was referred to. As I understood it this was centered at the middle difference and the confidence interval superimposed and CL crossing the curves represented ‘significant difference’. I just wanted to clarify how it was calculated and displayed (and interpreted). It certainly seems a better way than the individual CI.

f2harrell · October 20, 2019, 12:06pm

Gather the set of all distinct failure times in either treatment group. These are the only points at which either of the two KM estimates changes
Compute the two Kaplan-Meier cumulative incidence estimates (1 - KM estimates) at all of these points
Take their midpoints
Compute the two standard errors at all the same time points and compute the standard error of the difference (square root of sum of squared standard errors)
Half the width of the confidence intervals for differences is approximately 1.96 * standard error of the difference
Compute two step functions equal to the midpoints +/- 0.5 * 1.96 * SE of difference
Plot the shaded polygon with these two step function boundaries

See the R survdiffplot function with conf="diffbands" for details.

mdooris · October 20, 2019, 12:28pm

@f2harrell thank you. I saw code (lines 403-onwards). Very instrucive

f2harrell · October 20, 2019, 4:09pm

Regarding the section I did not get to—Statistical Tables—see link to a separate new video at the top of this topic.

med_stat · October 23, 2019, 12:12pm

@f2harrell, is there any chance you can cover visualizations of association involving qualitative variables?

There is one book of which I’m aware by Friendly and Meyer (Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data) that covers some of these ideas, but I think this might be a helpful topic for BBR given the descriptive power graphics provide.

f2harrell · October 23, 2019, 12:22pm

It’s a worthwhile area but I don’t have any plans to cover that at present. I’m sure the reference you gave does a much better job than I could for qualitative variables.

med_stat · October 23, 2019, 12:30pm

Understood, thank you!

dmap02 · October 25, 2019, 7:18pm

Hi,
This may be too general a question, I can post elsewhere if so, but I’ve been working through the graphics chapter and am trying to apply what I’m learning to my current data, a matrix of counts where samples are rows and species are columns.

I had never really thought about “ties” before this class. A hexagonal binning plot on my data revealed alot of ties. Given many ties, what should we do when analyzing our data?

I imagine trying to figure out if there is a systematic source that can account for the ties is at the top of the list (i.e., if there are many 0s). What else should be a part of a workflow when a dataset is full of ties?

f2harrell · October 28, 2019, 1:17pm

Good question. There may be different issues for formal analysis but to address the question just with respect to best graphical representation, you want to start with the most “raw” unbinned data you can obtain. Are there lots of ties in the primary raw data? And note that hex binning will create ties—too many ties if the bins are large.

The main answer may be to think carefully about the binning algorithm, if binning of the raw data is required. This thinking may dictate that zeros be kept in their own bin.

albertoca · November 8, 2019, 7:01pm

I’ve just finished watching the video from chapter 3 on charts, together with my colleagues at the hospital. We really liked it. It was also fun to see Frank at his beach house. Very nice teacher.
I would like to ask if anyone can explain the code for Figure 4.10 (Dot chart showing proportion of subjects having adverse events by treatment and differences). It is very useful to replace the ugly toxicity tables.
Many thanks to Frank and everyone.

albertoca · November 8, 2019, 11:12pm

I discovered the answer here : https://rdrr.io/cran/HH/man/ae.dotplot7a.html
Nice plot. The Amit graph.

f2harrell · November 8, 2019, 11:53pm

That’s not the software used in the example but that is excellent software.

tkaczye · November 10, 2019, 3:40pm

Thank you for the mechanistic explanation professor Harrell. However, I’m having trouble understanding what you said in the lecture about what it means if either cumulative incidence curve crosses the 1/2 confidence-interval width shaded area centered on the mean? Is there a linearity property of confidence intervals (CI) that is being applied, such that the 95% CI width for half a difference is exactly half the 95% CI width for the full difference? If I understand correctly, when the default shaded area in this plot falls within the individually plotted curves, there is 95% certainty that there is no difference in the curves’ incidence rates. However, when the grey area crosses either curve, I don’t see what at all you can say with 95% confidence.

ddhll · January 8, 2020, 4:57am

Thanks Frank and all involved for a stellar course. I am able to replicate the quantiles, mean and dispersion grapic with the support2 dataset. Wondering if you would mind sharing the code for the spike histograms with hovertext for statistical summary?

f2harrell · January 8, 2020, 1:44pm

Go to https://www.fharrell.com/talk/rmedicine19 and click on Code.

arthur_albuquerque · November 6, 2021, 4:18pm

Dear @f2harrell, perhaps this is a silly question but I cannot find the answer anywhere else.

Is there a way to remove the label automatically generated by rms::survdiffplot() on the bottom of the plot? In this case, it would be the “treat = No pacemaker - treat=Pacemaker”.

Thanks!

f2harrell · November 7, 2021, 1:07pm

I don’t think I included an option to suppress that. You can set a smaller bottom margin to keep it from appearing; play with options(mar=...).