The goal is to convert this dataframe from wide to long format. The thing that I cannot figure out is how to handle this nested structure. My dataframe looks like this:
df <- data.frame(dwelling=c('A', 'B', 'C', 'D'),
nearbypanels2020_250m=c(12, 15, 19, 19),
nearbypanels2019_250m=c(22, 29, 18, 12),
nearbypanels2020_500m=c(23, 16, 22, 33),
nearbypanels2019_500m=c(24, 31, 19, 14),
heatpump2020=c(20,25,21,19),
heatpump2019=c(19,24,18,12))
The desired end result looks like the table below. For clarification: the 'nearbypanels' columns indicate the number of solar panels in a certain year, and radius of a dwelling, which needs to be accounted for. The heat pump variable has no radius indication, so will only get one column in the end result.
dwelling | heatpump | nearbypanels_250m | nearbypanels_500m |
---|---|---|---|
A2019 | 19 | 22 | 24 |
A2020 | 20 | 12 | 23 |
B2019 | 24 | 29 | 31 |
B2020 | 25 | 15 | 16 |
C2019 | 18 | 18 | 19 |
C2020 | 21 | 19 | 22 |
etc | .. | ... | .. |
I would use pivot_longer()
from the tidyr
package. First, though, you would need to regularize the naming scheme so, for example, the year is the last four digits of the name. I did this by searching for four digits in a row and then moving them to the end of the string. I did this with gsub("(.*)(\\d{4})(.*)", paste("\\1","\\3", "\\2", sep=""), names(df))
, which searches for anything (which is saved in \\1
), then four digits (which are saved in \\2
) and then anything else (if it exists, stored in \\3
). Then, I paste the first and third values together then attach the year at the end. I use these to reset the names of the data frame. Then, in the pivot_longer()
specification, you can identify that one of the variables that moves to long format is the year (i.e., the last four digits of the variable name) and that otherwise, the stems (the values before the year) should remain variables (separate columns) in the data frame.
library(dplyr)
library(tidyr)
df <- data.frame(dwelling=c('A', 'B', 'C', 'D'),
nearbypanels2020_250m=c(12, 15, 19, 19),
nearbypanels2019_250m=c(22, 29, 18, 12),
nearbypanels2020_500m=c(23, 16, 22, 33),
nearbypanels2019_500m=c(24, 31, 19, 14),
heatpump2020=c(20,25,21,19),
heatpump2019=c(19,24,18,12))
df %>%
setNames(gsub("(.*)(\\d{4})(.*)",
paste("\\1","\\3", "\\2", sep=""),
names(df))) %>%
pivot_longer(-dwelling,
names_pattern="(.*)(\\d{4})$",
names_to=c(".value", "year")) %>%
mutate(dwelling = paste(dwelling, year, sep="")) %>%
select(-year)
#> # A tibble: 8 × 4
#> dwelling nearbypanels_250m nearbypanels_500m heatpump
#> <chr> <dbl> <dbl> <dbl>
#> 1 A2020 12 23 20
#> 2 A2019 22 24 19
#> 3 B2020 15 16 25
#> 4 B2019 29 31 24
#> 5 C2020 19 22 21
#> 6 C2019 18 19 18
#> 7 D2020 19 33 19
#> 8 D2019 12 14 12
Created on 2023-06-28 with reprex v2.0.2