Search code examples
rtidyrreshape

Reshape table from wide to long - nested keys


The goal is to convert this dataframe from wide to long format. The thing that I cannot figure out is how to handle this nested structure. My dataframe looks like this:

df <- data.frame(dwelling=c('A', 'B', 'C', 'D'),
                 nearbypanels2020_250m=c(12, 15, 19, 19),
                 nearbypanels2019_250m=c(22, 29, 18, 12),
                 nearbypanels2020_500m=c(23, 16, 22, 33),
                 nearbypanels2019_500m=c(24, 31, 19, 14),
                 heatpump2020=c(20,25,21,19),
                 heatpump2019=c(19,24,18,12))

The desired end result looks like the table below. For clarification: the 'nearbypanels' columns indicate the number of solar panels in a certain year, and radius of a dwelling, which needs to be accounted for. The heat pump variable has no radius indication, so will only get one column in the end result.

dwelling heatpump nearbypanels_250m nearbypanels_500m
A2019 19 22 24
A2020 20 12 23
B2019 24 29 31
B2020 25 15 16
C2019 18 18 19
C2020 21 19 22
etc .. ... ..

Solution

  • I would use pivot_longer() from the tidyr package. First, though, you would need to regularize the naming scheme so, for example, the year is the last four digits of the name. I did this by searching for four digits in a row and then moving them to the end of the string. I did this with gsub("(.*)(\\d{4})(.*)", paste("\\1","\\3", "\\2", sep=""), names(df)), which searches for anything (which is saved in \\1), then four digits (which are saved in \\2) and then anything else (if it exists, stored in \\3). Then, I paste the first and third values together then attach the year at the end. I use these to reset the names of the data frame. Then, in the pivot_longer() specification, you can identify that one of the variables that moves to long format is the year (i.e., the last four digits of the variable name) and that otherwise, the stems (the values before the year) should remain variables (separate columns) in the data frame.

    library(dplyr)
    library(tidyr)
    df <- data.frame(dwelling=c('A', 'B', 'C', 'D'),
                     nearbypanels2020_250m=c(12, 15, 19, 19),
                     nearbypanels2019_250m=c(22, 29, 18, 12),
                     nearbypanels2020_500m=c(23, 16, 22, 33),
                     nearbypanels2019_500m=c(24, 31, 19, 14),
                     heatpump2020=c(20,25,21,19),
                     heatpump2019=c(19,24,18,12)) 
    df %>%
      setNames(gsub("(.*)(\\d{4})(.*)", 
                  paste("\\1","\\3", "\\2", sep=""), 
                  names(df))) %>% 
      pivot_longer(-dwelling, 
                   names_pattern="(.*)(\\d{4})$", 
                   names_to=c(".value", "year")) %>% 
      mutate(dwelling = paste(dwelling, year, sep="")) %>% 
      select(-year)
    #> # A tibble: 8 × 4
    #>   dwelling nearbypanels_250m nearbypanels_500m heatpump
    #>   <chr>                <dbl>             <dbl>    <dbl>
    #> 1 A2020                   12                23       20
    #> 2 A2019                   22                24       19
    #> 3 B2020                   15                16       25
    #> 4 B2019                   29                31       24
    #> 5 C2020                   19                22       21
    #> 6 C2019                   18                19       18
    #> 7 D2020                   19                33       19
    #> 8 D2019                   12                14       12
    

    Created on 2023-06-28 with reprex v2.0.2