Search code examples
rtidyrreshape2

Converting data from wide to long format when id variables are encoded in column header


I am relatively new to R and have data in wide format as follows

subject_id   age    sex  treat1.1.param1    treat1.1.param2   treat1.2.param1   treat1.2.param2
-----------------------------------------------------------------------------------------------
1             23     M         1                  2                  3                   4
2             25     W         5                  6                  7                   8

which is data on several subjects for which we have for a given treatment (here treat1) measures several parameters (here param1 and param2) over multiple rounds of repeated measurements (here round 1 and round 2). The information which treatment, round and parameter the entry for this subject belongs to is coded in the column header as exemplified above.

I would like to have the data in long format exemplified as follows:

subject_id  age sex treatment   round       param1      param2
------------------------------------------------------------------------------------------
1           23   M   treat1      1           1          2
1           23   M   treat1      2           3          4
2           25   W   treat1      1           5          6
2           25   W   treat1      2           7          8

That is, the id variable to identify a single observation are subject_id, treatment, round. But since the latter two variables are encoded in the column headers using dots as separators, I don't know how to move from the wide to long format as above. All tries with standard examples (using reshape2 or tidyr) have failed. Since in reality, I have 12 treatments with each 30 rounds and about 50 parameters per round, a relatively manual way of doing it would not help me too much.


Solution

  • We can use pivot_longer from tidyr specifying the names_to and names_pattern argument.

    tidyr::pivot_longer(df, 
                        cols = starts_with("treat"), 
                        names_to = c("treatmeant", "round", ".value"), 
                        names_pattern =  "(\\w+)\\.(\\d+)\\.(\\w+)")
    
    #  subject_id   age sex   treatmeant round param1 param2
    #       <int> <int> <fct> <chr>      <chr>  <int>  <int>
    #1          1    23 M     treat1     1          1      2
    #2          1    23 M     treat1     2          3      4
    #3          2    25 W     treat1     1          5      6
    #4          2    25 W     treat1     2          7      8
    

    data

    df <- structure(list(subject_id = 1:2, age = c(23L, 25L), sex = structure(1:2, 
    .Label = c("M", "W"), class = "factor"), 
    treat1.1.param1 = c(1L, 5L), treat1.1.param2 = c(2L, 6L), 
    treat1.2.param1 = c(3L, 7L), treat1.2.param2 = c(4L, 8L)), 
    class = "data.frame", row.names = c(NA, -2L))