What dataset to use for descriptive summaries: long vs wide

CP3 · September 22, 2025, 4:16pm

Hi Frank and team.

I am generating a descriptive summary table for participants in my study.

The outcome is a continuous BMI value, repeated measures. The data are unevenly spaced, and each participant may contribute multiple BMI values at different ages.
In the raw dataset, time is recorded as a continuous variable representing the participant’s age at the time of measurement.


n_subject <- 10
subject_ids  <- paste0("C", sprintf("%03d", 1:n_subject))

n_visits <- sample(3:6, n_subject, replace = TRUE)

# dataset
df_list <- lapply(seq_along(subject_ids), function(i) {
  n <- n_visits[i]
  data.frame(
    Participant_ID = subject_ids[i],
    Age     = sort(round(runif(n, min = 0, max = 9), 2)),  # continuous ages
    BMI     = round(rnorm(n, mean = 16, sd = 2), 2)
  )
})

df <- do.call(rbind, df_list)
rownames(df) <- NULL

The scientist has requested the following data processing steps:

Discretize time into whole-year bins (0, 1, 2, 3, 4, …).

# Categorize age to discrete time bins , 0 - 0.99, 1-1.99, 2-2.99 etc using CUT function

df$Age_Group <- cut(
  df$Age,
  breaks = seq(0, 10, 1),   # 0–1, 1–2, …, 9–10
  right  = FALSE,           # include lower bound, exclude upper
  labels = 0:9
)

Within each time bin, if a participant has more than one BMI value, take the average of those values. (If a participant has exactly one BMI value in a given bin, keep as is.)


df_avg <- df %>%
  group_by(Participant_ID, Age_Group) %>%
  summarise(
    BMI_avg = mean(BMI, na.rm = TRUE),
    .groups = "drop"
  )

df_avg

Transform the dataset to wide format, so that each participant is represented by a single row, with one column per age bin (e.g., bmi_0, bmi_1, bmi_2, …).


# transform long → wide
df_wide <- df_avg %>%
  pivot_wider(
    id_cols = Participant_ID,
    names_from = Age_Group,
    values_from = BMI_avg,
    names_prefix = "bmi_"
  )

The transformed dataset has one row per participant, with BMI values either averaged within bins (if multiple exist) or carried over directly (if only one value in each time bin).

For analysis the scientists want to use the transformed wide data format, one row per participant, dataset.

My question is which dataset should I use when I generate descriptive summaries describing the BMI characteristics of participants in the study dataset. Should I use to long format with all the BMI values or should I use to the transformed wide data format, one row per participant, dataset ?

Thanks.

-Clif

f2harrell · September 23, 2025, 11:48am

There is no need to discard any BMIs. Instead keep the dataset irregular, tall, and thin. Also it is usually better to model weight as a function of time and baseline height.

CP3 · September 23, 2025, 2:11pm

Thank you Frank, will do.

f2harrell · September 23, 2025, 7:49pm

In terms of manipulating such datasets some R code tips may be found here.

CP3 · September 24, 2025, 2:46pm

Great resource. Thanks Frank.