Search code examples
rloopsdata-munging

Iterating across sequences of columns (of same crop-related data) to build tidy rbind dataset in R


I am working with a messy wide format dataset of CROP data where each crop has six associated variables (crop type, crop acreage, nitrogen in soil, nitrogen applied, organic v. conventional, and notes). There are 50 sets of these 6 variables, 300 columns total, with the unfortunate naming convention: c.0. | a.0. | s.0. | f.0. | o.0. | r.0. |, .... , | c.23. | a.23 | s.23. | f.23. |o.23. | r.23. |.

I want to iterate across sequences of columns (the 6 columns associated with each crop), saving each iteration, or individual crop, as a list (or df) with the common colnames c(CROP, CROP_ACREAGE, SOIL_N, APP_N, TYPE, NOTES). Once all of the crops have been iterated through, I want to build a tidy df by rbinding the 50 individual crops datasets.

Below is an example dataset with only two sets of crop columns, four growers, across three years:

data <- tribble(
~grower, ~ YEAR, ~c.0., ~a.0., ~s.0., ~f.0., ~o.0., ~r.0., ~c.10., ~a.10, ~s.10., ~f.10., ~o.10., ~r.10.,
"Bob", 2014, "Kale, Baby", 7.0, 87.0, 126.0, "C", "", "Carrot", 16.0, 47.8, 137.0, "O", "",
"Janet", 2015, "Broccoli", 18.0, 68.2,162.0, "O", "", "Garlic", 25.0, 9.1, 152.3, "C", "",
"Chris", 2014, "Cabbage", 34.2, 8.6, 200.7, "C", "", "Cauliflower", 105.2, 113.0, 199.4, "O", "",
"Ted", 2016, "Kale", 12.2, 11.9, 120.2, "C", "", "Lettuce, Head", 55.2, 113.0, 166.5, "C", "NY"
)

I've built code that will work... but that will be SUPER inefficient:

build <- function(choice, iter){
  
  a <- data %>% select(choice) %>% 
  rename(CROP = 3,
         ACREAGE = 4,
         N_SOIL = 5,
         N_APPLIED = 6,
         O_C = 7,
         NOTES = 8)
  
  saveRDS(a, file = paste0("./intermediate-data/",iter,".RDS"))

}

build(choice = c(1:8), iter = "crop1")
build(choice = c(1:2,9:14), iter = "crop2")
# use build() function until all 50 crops have been built into separate df's

Then I'd pull these all in from the intermediate-data directory and rbind().

But I know there has got to be a more efficient way... where I can iterate across sets of 6-crop-columns without specifying them directly. Any ideas?


Solution

  • We may reshape to 'long' format with pivot_longer

    library(dplyr)
    data %>% 
       pivot_longer(cols = contains("."), names_to = c(".value", "grp"), 
             names_pattern = "^([a-z])\\.(\\d+)\\.") %>%
       rename(CROP = c, ACREAGE = a, N_SOIL = s, N_APPLIED = f, O_C = o, NOTES = r)
    

    -output

    # A tibble: 12 × 9
       grower  YEAR grp   CROP          ACREAGE N_SOIL N_APPLIED O_C   NOTES
       <chr>  <dbl> <chr> <chr>           <dbl>  <dbl>     <dbl> <chr> <chr>
     1 Bob     2014 0     Kale, Baby        7     87        126  C     ""   
     2 Bob     2014 10    Carrot           NA     47.8      137  O     ""   
     3 Bob     2014 <NA>  <NA>             NA     NA         NA  <NA>   <NA>
     4 Janet   2015 0     Broccoli         18     68.2      162  O     ""   
     5 Janet   2015 10    Garlic           NA      9.1      152. C     ""   
     6 Janet   2015 <NA>  <NA>             NA     NA         NA  <NA>   <NA>
     7 Chris   2014 0     Cabbage          34.2    8.6      201. C     ""   
     8 Chris   2014 10    Cauliflower      NA    113        199. O     ""   
     9 Chris   2014 <NA>  <NA>             NA     NA         NA  <NA>   <NA>
    10 Ted     2016 0     Kale             12.2   11.9      120. C     ""   
    11 Ted     2016 10    Lettuce, Head    NA    113        166. C     "NY" 
    12 Ted     2016 <NA>  <NA>             NA     NA         NA  <NA>   <NA>
    >