Search code examples
rpivotreshapelong-format-datawide-format-data

How to reshape wide format data to long format?


I have a wide format dataset, which contains year variables (yr1, yr2, yr3), and duration variables (yr1_time, yr2_time, yr3_time). yr1 ranges from 2023 to 2025. yr2 and yr3 equals to yr1+1 or +2, respectively.

id<-rep(c(1:20),times=1)
df1<-data.frame(id)
df1$yr1 <- sample(2022:2025, length(df1$id), replace=TRUE)
df1$yr1_time <- rnorm(n = 20, mean = 0, sd = 0.6)
df1$yr2 <- df1$yr1+1
df1$yr2_time <- rnorm(n = 20, mean = 0, sd = 0.6)
df1$yr3 <- df1$yr1+2
df1$yr3_time <- rnorm(n = 20, mean = 0, sd = 0.6)

print(df1)

#   id  yr1    yr1_time  yr2    yr2_time  yr3    yr3_time
# 1   1 2023 -0.18649844 2024  1.41458053 2025 -1.12031610
# 2   2 2025 -0.01977439 2026  0.68985414 2027 -0.69038076
# 3   3 2023 -0.08855173 2024  0.76039453 2025 -0.36913641
# 4   4 2023  0.28576478 2024 -0.35622031 2025  0.89810598
# 5   5 2024 -0.42831014 2025 -1.28914071 2026  0.44912268
# 6   6 2023 -1.02487195 2024 -0.27391726 2025 -0.62189347
# 7   7 2024  0.16888122 2025 -0.10572896 2026 -0.43966363
# 8   8 2025  0.80350550 2026  0.41403554 2027 -1.41913317
# 9   9 2023  0.59990953 2024 -0.42688373 2025 -0.73899889

How to shape the wide format to the long format? Here is my expected output:

   id  yr    yr_time  
    1 2023 -0.18649844 
    1 2024 1.41458053
    1 2025 -1.12031610
    2 2025 -0.01977439 
    2 2026  0.68985414
    2 2027 -0.69038076



Thanks!


Solution

  • You can do this by using pivot_longer() from the tidyr package to convert the data to long format, then several functions from the dplyr package to get to the final format you want.

    library(dplyr)
    library(tidyr)
    
    df1 |> 
      # Convert data to long format
      pivot_longer(cols = starts_with("yr")) |> 
      mutate(
        # If a row represents a year, assign its `value` to the `year` column
        year = if_else(name %in% c("yr1", "yr2", "yr3"), value, NA_real_),
        # Move all the values up one row (this moves the `yr_time` for each year 
        # into the same row as the corresponding year)
        yr_time = lead(value)
      ) |>
      # Remove the rows that don't contain a year value, since those rows aren't 
      # needed (and now contain the wrong `yr_time` values)
      filter(!is.na(year)) |> 
      # Arrange columns in the order you asked for
      select(id, year, yr_time)
    

    If I haven't explained each step well, try running each step separately and you should be able to see how it works.