What dataset to use for descriptive summaries: long vs wide

Hi Frank and team.

I am generating a descriptive summary table for participants in my study.

The outcome is a continuous BMI value, repeated measures. The data are unevenly spaced, and each participant may contribute multiple BMI values at different ages.
In the raw dataset, time is recorded as a continuous variable representing the participant’s age at the time of measurement.


n_subject <- 10
subject_ids  <- paste0("C", sprintf("%03d", 1:n_subject))

n_visits <- sample(3:6, n_subject, replace = TRUE)

# dataset
df_list <- lapply(seq_along(subject_ids), function(i) {
  n <- n_visits[i]
  data.frame(
    Participant_ID = subject_ids[i],
    Age     = sort(round(runif(n, min = 0, max = 9), 2)),  # continuous ages
    BMI     = round(rnorm(n, mean = 16, sd = 2), 2)
  )
})

df <- do.call(rbind, df_list)
rownames(df) <- NULL

The scientist has requested the following data processing steps:

  1. Discretize time into whole-year bins (0, 1, 2, 3, 4, …).
# Categorize age to discrete time bins , 0 - 0.99, 1-1.99, 2-2.99 etc using CUT function

df$Age_Group <- cut(
  df$Age,
  breaks = seq(0, 10, 1),   # 0–1, 1–2, …, 9–10
  right  = FALSE,           # include lower bound, exclude upper
  labels = 0:9
)
  1. Within each time bin, if a participant has more than one BMI value, take the average of those values. (If a participant has exactly one BMI value in a given bin, keep as is.)

df_avg <- df %>%
  group_by(Participant_ID, Age_Group) %>%
  summarise(
    BMI_avg = mean(BMI, na.rm = TRUE),
    .groups = "drop"
  )

df_avg
  1. Transform the dataset to wide format, so that each participant is represented by a single row, with one column per age bin (e.g., bmi_0, bmi_1, bmi_2, …).

# transform long → wide
df_wide <- df_avg %>%
  pivot_wider(
    id_cols = Participant_ID,
    names_from = Age_Group,
    values_from = BMI_avg,
    names_prefix = "bmi_"
  )

The transformed dataset has one row per participant, with BMI values either averaged within bins (if multiple exist) or carried over directly (if only one value in each time bin).

For analysis the scientists want to use the transformed wide data format, one row per participant, dataset.

My question is which dataset should I use when I generate descriptive summaries describing the BMI characteristics of participants in the study dataset. Should I use to long format with all the BMI values or should I use to the transformed wide data format, one row per participant, dataset ?

Thanks.

-Clif

There is no need to discard any BMIs. Instead keep the dataset irregular, tall, and thin. Also it is usually better to model weight as a function of time and baseline height.

1 Like

Thank you Frank, will do.

In terms of manipulating such datasets some R code tips may be found here.

2 Likes

Great resource. Thanks Frank.