Search code examples
rreshapereshape2

Reshape large dataset from wide to long with two ID variables


I want to change my data from long to wide format using two ID variables.

I have the below code that works with the below example dataset. However, when I run this code with a much larger dataset that I am working with, the code runs for a very long time and doesn't seem to finish running. When I use one ID variable the code runs fine, but I need to include two.

Is there a more efficient way of changing from long to wide format?

(I've also thought about creating an ID variable based on ID1 and ID2 for the purposes of converting from long to wide. Perhaps this is the best solution?)

Wide.vars <- names(df[,c("Date","V1")])


### 1. Reshape from wide to long format with two ID variables
df_wide <- reshape(as.data.frame(df),                                  
                     idvar = c("ID1","ID2"), 
                     direction = "wide",
                     v.names = Wide.vars,
                     timevar = "Timepoint")

Example data below (note that the dimensions of the example dataset are 15 rows 5 columns, whereas the dataset I'm working with is 15658 rows by 99 columns).

df <- structure(list(ID1 = c(5643923L, 5643923L, 5643923L, 3914822L, 
3914822L, 3914822L, 3914822L, 1156115L, 1506426L, 7183921L, 4753447L, 
4606792L, 8492773L, 8492773L, 8492773L), ID2 = c("02179", 
"02179", "04101", "00819", "00819", "00819", "00819", 
"01904", "01127", "00475", "02084", "04118", "15553", 
"15553", "15553"), Date = structure(c(16731, 16731, 
16731, 16732, 16733, 16733, 16733, 16733, 16733, 16733, 16733, 
16733, 16734, 16734, 16734), class = "Date"), Timepoint = structure(c(1L, 
3L, 1L, 1L, 3L, 4L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 4L), .Label = c("baseline", 
"wave0.5", "wave1", "wave2", "wave3", "wave4"), class = "factor"), V1 = c(0, 8, 4, 9.5, 7, 7, 12, 9, 11, 8.4, 
    7.8, 6.6, 5, 5.5, 8.9)), row.names = c(NA, 
-15L), groups = structure(list(CP1_t_210 = structure(1L, .Label = c("baseline", 
"wave0.5", "wave1", "wave2", "wave3", "wave4"), class = "factor"), 
    .rows = structure(list(1:15), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, -1L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))



Solution

  • data.table is usually faster, you can try using dcast from it.

    library(data.table)
    dcast(setDT(df), ID1+ID2~Timepoint, value.var = c('Date', 'V1'))
    

    As suggested by @Mark Davies pivot_wider can also help.

    tidyr::pivot_wider(df, names_from = Timepoint, values_from = c(Date, V1))