Search code examples
rtransformation

Transforming wide data to long format with multiple variables


this may have a simple answer but after after a few hours of searching I still cannot find it. Basically I need to turn a wide dataset to a long format dataset but with multiple variables. My dataset structure looks like this:

df1 <- data.frame(id = c(1,2,3),
                  sex = c("M","F","M"),
                  day0s = c(21,25,15),
                  day1s = c(20,30,18),
                  day2s = c(18,18,17),
                  day0t = c(2,5,7),
                  day1t = c(3,6,5),
                  day2t = c(3,8,7))
df1
 id sex day0s  day1s  day2s day0t  day1t  day2t
 1   M    21     20     18     2      3      3
 2   F    25     30     18     5      6      8
 3   M    15     18     17     7      5      7

Basically 3 subjects have done a math test (s) and history test (t) every day for 3 days. I tried to use gather from tidyr to turn it into long form, but I don't know how to assign the mt and ht variables to the same day. I also coded a new variable day with just day0 = 0, day1 = 1 and day2 = 2.

dfl <- df1 %>%
  gather(day, value, - c(id,sex))
dfl
id sex  variable value  day
1   M   day0s     21    0
1   M   day1s     20    1
1   M   day2s     18    2
1   M   day0t      2    0
1   M   day1t      3    1
1   M   day2t      3    2
2   F   day0s     25    0
2   F   day1s     30    1
2   F   day2s     18    2
2   F   day0t      5    0
2   F   day1t      6    1
2   F   day2t      8    2
3   M   day0s     15    0
3   M   day1s     18    1
3   M   day2s     17    2
3   M   day0t      7    0
3   M   day1t      5    1
3   M   day2t      7    1

Ideally in the end it should look like this.

id sex   day   s     t
1   M     0    21    2
1   M     1    20    3
1   M     2    18    3
2   F     0    25    5
2   F     1    30    6
2   F     2    18    8
3   M     0    15    7
3   M     1    18    5
3   M     2    17    7

Do you please have any suggestions on how to achieve this?


Solution

  • You can use {tidyr}'s pivot_longer here.

    If your actual variables are named a bit differently, you can adapt the regex to your case. Here you can try out and adapt accordingly . (Note that in R the backslash has to be escaped, therefore the double backslash in \\d+ and \\w+)

    In general, the names_pattern argument works by matching the regex within the parenthesis with the names_to argument, so that here:

    • (\\d+) -> becomes variable day. Regex \d+ matches 1 or more digits.
    • (\\w+) -> becomes ".value". Regex \w+ matches 1 or more word character. Thanks to r2evans for pointing out the ".value" argument that spares one further reshape. The documentation states that .value "tells pivot_longer() that that part of the column name specifies the “value” being measured (which will become a variable in the output)." While I don't fully grasp the documentation explanation, the results are that the matching regex are mapped to the variable names in the output data.
    library(dplyr)
    library(tidyr)
    
    df1 <- data.frame(id = c(1,2,3),
                      sex = c("M","F","M"),
                      day0mt = c(21,25,15),
                      day1mt = c(20,30,18),
                      day2mt = c(18,18,17),
                      day0ht = c(2,5,7),
                      day1ht = c(3,6,5),
                      day2ht = c(3,8,7))
    
    df1
    #>   id sex day0mt day1mt day2mt day0ht day1ht day2ht
    #> 1  1   M     21     20     18      2      3      3
    #> 2  2   F     25     30     18      5      6      8
    #> 3  3   M     15     18     17      7      5      7
    
    df1 %>%
      pivot_longer(cols = starts_with("day"),
                   names_pattern = "day(\\d+)(\\w+)",
                   names_to = c("day", ".value"))
    #> # A tibble: 9 x 5
    #>      id sex   day      mt    ht
    #>   <dbl> <chr> <chr> <dbl> <dbl>
    #> 1     1 M     0        21     2
    #> 2     1 M     1        20     3
    #> 3     1 M     2        18     3
    #> 4     2 F     0        25     5
    #> 5     2 F     1        30     6
    #> 6     2 F     2        18     8
    #> 7     3 M     0        15     7
    #> 8     3 M     1        18     5
    #> 9     3 M     2        17     7
    

    Created on 2021-06-20 by the reprex package (v2.0.0)

    Note that in newer versions of tidyr, gather and spread are deprecated and replaced by pivot_longer and pivot_wider.