this may have a simple answer but after after a few hours of searching I still cannot find it. Basically I need to turn a wide dataset to a long format dataset but with multiple variables. My dataset structure looks like this:
df1 <- data.frame(id = c(1,2,3),
sex = c("M","F","M"),
day0s = c(21,25,15),
day1s = c(20,30,18),
day2s = c(18,18,17),
day0t = c(2,5,7),
day1t = c(3,6,5),
day2t = c(3,8,7))
df1
id sex day0s day1s day2s day0t day1t day2t
1 M 21 20 18 2 3 3
2 F 25 30 18 5 6 8
3 M 15 18 17 7 5 7
Basically 3 subjects have done a math test (s) and history test (t) every day for 3 days. I tried to use gather from tidyr to turn it into long form, but I don't know how to assign the mt and ht variables to the same day. I also coded a new variable day with just day0 = 0, day1 = 1 and day2 = 2.
dfl <- df1 %>%
gather(day, value, - c(id,sex))
dfl
id sex variable value day
1 M day0s 21 0
1 M day1s 20 1
1 M day2s 18 2
1 M day0t 2 0
1 M day1t 3 1
1 M day2t 3 2
2 F day0s 25 0
2 F day1s 30 1
2 F day2s 18 2
2 F day0t 5 0
2 F day1t 6 1
2 F day2t 8 2
3 M day0s 15 0
3 M day1s 18 1
3 M day2s 17 2
3 M day0t 7 0
3 M day1t 5 1
3 M day2t 7 1
Ideally in the end it should look like this.
id sex day s t
1 M 0 21 2
1 M 1 20 3
1 M 2 18 3
2 F 0 25 5
2 F 1 30 6
2 F 2 18 8
3 M 0 15 7
3 M 1 18 5
3 M 2 17 7
Do you please have any suggestions on how to achieve this?
You can use {tidyr}'s pivot_longer
here.
If your actual variables are named a bit differently, you can adapt the regex to your case. Here you can try out and adapt accordingly . (Note that in R the backslash has to be escaped, therefore the
double backslash in \\d+
and \\w+
)
In general, the names_pattern
argument works by matching the regex within the parenthesis with the names_to
argument, so that here:
(\\d+)
-> becomes variable day. Regex \d+
matches 1 or more digits.(\\w+)
-> becomes ".value". Regex \w+
matches 1 or more word character. Thanks to r2evans for pointing out the ".value" argument that spares one further reshape. The documentation states that .value
"tells pivot_longer() that that part of the column name specifies the “value” being measured (which will become a variable in the output)." While I don't fully grasp the documentation explanation, the results are that the matching regex are mapped to the variable names in the output data.library(dplyr)
library(tidyr)
df1 <- data.frame(id = c(1,2,3),
sex = c("M","F","M"),
day0mt = c(21,25,15),
day1mt = c(20,30,18),
day2mt = c(18,18,17),
day0ht = c(2,5,7),
day1ht = c(3,6,5),
day2ht = c(3,8,7))
df1
#> id sex day0mt day1mt day2mt day0ht day1ht day2ht
#> 1 1 M 21 20 18 2 3 3
#> 2 2 F 25 30 18 5 6 8
#> 3 3 M 15 18 17 7 5 7
df1 %>%
pivot_longer(cols = starts_with("day"),
names_pattern = "day(\\d+)(\\w+)",
names_to = c("day", ".value"))
#> # A tibble: 9 x 5
#> id sex day mt ht
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 M 0 21 2
#> 2 1 M 1 20 3
#> 3 1 M 2 18 3
#> 4 2 F 0 25 5
#> 5 2 F 1 30 6
#> 6 2 F 2 18 8
#> 7 3 M 0 15 7
#> 8 3 M 1 18 5
#> 9 3 M 2 17 7
Created on 2021-06-20 by the reprex package (v2.0.0)
Note that in newer versions of tidyr, gather
and spread
are deprecated and replaced by pivot_longer
and pivot_wider.