Search code examples
rdplyrpivottidyr

Pivot_longer() dataset by splitting the header's character strings into multiple columns


My data set contains values for multiple scenarios (S1, S2, S3) across multiple years, for a linking variable called link_col.

link_col <- c("A0001", "A0002", "A0003")
S1_ABC_2019 <- c(1000, 300, 5000)
S1_ABC_2020 <- c(1000, 300, 5000)
S1_ABC_2021 <- c(1000, 300, 5000)
S2_ABC_2012 <- c(800, 600, 2000)
S2_ABC_2013 <- c(800, 600, 2000)
S2_ABC_2019 <- c(800, 600, 2000)
S3_ABC_2013 <- c(600, 100, 900)
S3_ABC_2020 <- c(600, 100, 900)
S3_ABC_2021 <- c(600, 100, 900)

data <- data.frame(link_col, S1_ABC_2019, S1_ABC_2020, S1_ABC_2021, S2_ABC_2012,
                   S2_ABC_2013, S2_ABC_2019, S3_ABC_2013, S3_ABC_2020, S3_ABC_2021)

Using dplyr want to pivot_longer() this dataset by condensing these wide columns into one each for scenario, year, and value, to make figures in ggplot. This is what I'm after:

    link_col Scenario Year Value
1      A0001       S1 2019  1000
2      A0002       S1 2020   300
3      A0003       S1 2021  5000
4      A0001       S2 2012  1000
5      A0002       S2 2013   300
6      A0003       S2 2019  5000
7      A0001       S3 2013  1000
8      A0002       S3 2020   300
9      A0003       S3 2021  5000
10     A0001       S1 2019   800
11     A0002       S1 2020   600
12     A0003       S2 2013  2000
13     A0001       S2 2019   800
14     A0002       S3 2015   600
15     A0003       S3 2021  2000
16     A0001       S1 2019   800
17     A0002       S1 2020   600
18     A0003       S1 2021  2000
19     A0001       S2 2012   600
20     A0002       S2 2013   100
21     A0003       S2 2019   900
22     A0001       S3 2013   600
23     A0002       S3 2020   100
24     A0003       S3 2021   900
25     A0001       S1 2019   600
26     A0002       S1 2020   100
27     A0003       S2 2013   900
28     A0001       S2 2019  1100
29     A0002       S3 2015   400
30     A0003       S3 2021  4800
31     A0001       S1 2019  1100
32     A0002       S1 2020   400
33     A0003       S1 2021  4800
34     A0001       S2 2012   900
35     A0002       S2 2013   600
36     A0003       S2 2019  2100
37     A0001       S3 2013   900
38     A0002       S3 2020   600
39     A0003       S3 2021  2100
40     A0001       S1 2019   500
41     A0002       S1 2020   200
42     A0003       S2 2013   800
43     A0001       S2 2019   500
44     A0002       S3 2015   200
45     A0003       S3 2021   800

I've tried this but it doesn't work, and I don't really understand the use of the names_pattern parameter. Any help would be appreciated!

data2 <- pivot_longer(data, 
                      cols = link_col, starts_with("S"),
                      names_to = c("Scenario", "Year"),
                      names_pattern = "S(\\d+)_ABC_2(\\d+)",
                      values_to = Value)

Solution

  • You were pretty close:

    library(tidyr)
    pivot_longer(data, cols = -link_col,
                 names_pattern = "(S\\d+)_ABC_(2\\d+)",
                 names_to = c("Scenario", "Year"),
                 values_to = "Value")
    

    output

    # # A tibble: 27 × 4
    #    link_col Scenario Year  Value
    #    <chr>    <chr>    <chr> <dbl>
    #  1 A0001    S1       2019   1000
    #  2 A0001    S1       2020   1000
    #  3 A0001    S1       2021   1000
    #  4 A0001    S2       2012    800
    #  5 A0001    S2       2013    800
    #  6 A0001    S2       2019    800
    #  7 A0001    S3       2013    600
    #  8 A0001    S3       2020    600
    #  9 A0001    S3       2021    600
    # 10 A0002    S1       2019    300
    # # ℹ 17 more rows
    # # ℹ Use `print(n = ...)` to see more rows
    

    Some explanations:

    • In names_pattern, the parentheses are used to capture groups, and names_to will name those groups. You want one group for the Scenario, which is captured by including the "S", and your second group is for the years.
    • The cols argument needs all columns to pivot into a longer format, so here it's everything but the link_col.