Search code examples
rreshape2dcast

How to restructure data with one observation by row into data with one observation by ID (and multiple columns) in R?


Let's say I have a dataframe with 3 ID columns and one column of interest. Each row represents one observation. Some ID have multiple observations, i.e., multiple rows.

df <- data.frame(id1 = c(  1,   2,   3,   4,   4), 
                 id2 = c( 11,  12,  13,  14,  14), 
                 id3 = c(111, 112, 113, 114, 114), 
                 variable_of_interest = c(13, 24, 35, 31, 12))

  id1 id2 id3 variable_of_interest
1   1  11 111                   13
2   2  12 112                   24
3   3  13 113                   35
4   4  14 114                   31
5   4  14 114                   12

My goal is to restructure it in odred to have one row per ID, to keep the 3 IDs and to name the new columns "variable_of_interest1", "variable_of_interest2":

  id1 id2 id3 variable_of_interest1 variable_of_interest1
1   1  11 111                    13                    NA
2   2  12 112                    24                    NA
3   3  13 113                    35                    NA
4   4  14 114                    31                    12

The solution might need reshape2 and the dcast function, but until now, I could not solve this out.


Solution

  • We can create a sequence grouped by the 'id' columns and then with pivot_wider reshape to wide

    library(dplyr)
    library(stringr)
    library(tidyr)
    library(data.table)
    df %>% 
      mutate(ind = str_c('variable_of_interest', rowid(id1, id2, id3))) %>% 
      pivot_wider(names_from = ind, values_from = variable_of_interest)
    

    -output

    # A tibble: 4 x 5
    #    id1   id2   id3 variable_of_interest1 variable_of_interest2
    #  <dbl> <dbl> <dbl>                 <dbl>                 <dbl>
    #1     1    11   111                    13                    NA
    #2     2    12   112                    24                    NA
    #3     3    13   113                    35                    NA
    #4     4    14   114                    31                    12
    

    Or another option is data.table

    library(data.table)
    dcast(setDT(df),  id1 + id2 + id3 ~ 
      paste0('variable_of_interest', rowid(id1, id2, id3)),
          value.var = 'variable_of_interest')
    

    -output

    #    id1 id2 id3 variable_of_interest1 variable_of_interest2
    #1:   1  11 111                    13                    NA
    #2:   2  12 112                    24                    NA
    #3:   3  13 113                    35                    NA
    #4:   4  14 114                    31                    12