Search code examples
rdplyrtidyversecoalescedata-dictionary

How to effectively append datasets using a dictionary (with R/dplyr)? / How to coalesce 'all columns with duplicate names'?


I have a series of data sets and a dictionary to bring these together. But I'm struggling to figure out how to automate this.

Suppose this data and dictionary (actual one is much longer, thus I want to automate):

mtcarsA <- mtcars[1:5,] %>% rename(mpgA = mpg, cyl_A = cyl) %>% as_tibble()
mtcarsB <- mtcars[6:10,] %>% rename(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()

dic <- tibble(true_name  = c("mpg_true", "cyl_true"), 
              nameA = c("mpgA", "cyl_A"), 
              nameB = c("mpg_B", "B_cyl")
)

I want these datasets (from years A and B) appended to one another, and then to have the names changed or coalesced to the 'true_name' values.

I can bring the data sets together into mtcars_all, and then I tried recoding the column names with the dictionary as follows


mtcars_all <- bind_rows((mtcarsA, mtcarsB)

recode_colname <- function(df, tn=dic$true_name, fname){
  colnames(df) <-  dplyr::recode(colnames(df),
                                !!!setNames(as.character(tn), fname))
  return(df)
  }

mtcars_all <- mtcars_all %>%
  recode_colname(fname=dic$nameA) %>%
  recode_colname(fname=dic$nameB)

But then I get duplicate columns. Of course I could coalesce each of these duplicate columns by name, but there will be many of these in my real case, so I want to automate 'coalesce all columns with duplicate names'.

I'm giving the entire problem here because perhaps someone also has a better solution for 'using a data dictionary'.


Solution

  • You can create a named vector to replace column names.

    library(tidyverse)
    
    pmap(dic, ~setNames(..1, paste0(c(..2, ..3), collapse = '|'))) %>%
      flatten_chr() -> val
    
    val
    # mpgA|mpg_B cyl_A|B_cyl 
    # "mpg_true"  "cyl_true" 
    

    And apply it on list of dataframes and combine them.

    list(mtcarsA,mtcarsB) %>%
      map_df(function(x) x %>% rename_with(~str_replace_all(.x, val)))
    
    #   mpg_true cyl_true  disp    hp  drat    wt  qsec    vs    am  gear  carb
    #      <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    # 1     21          6  160    110  3.9   2.62  16.5     0     1     4     4
    # 2     21          6  160    110  3.9   2.88  17.0     0     1     4     4
    # 3     22.8        4  108     93  3.85  2.32  18.6     1     1     4     1
    # 4     21.4        6  258    110  3.08  3.22  19.4     1     0     3     1
    # 5     18.7        8  360    175  3.15  3.44  17.0     0     0     3     2
    # 6     18.1        6  225    105  2.76  3.46  20.2     1     0     3     1
    # 7     14.3        8  360    245  3.21  3.57  15.8     0     0     3     4
    # 8     24.4        4  147.    62  3.69  3.19  20       1     0     4     2
    # 9     22.8        4  141.    95  3.92  3.15  22.9     1     0     4     2
    #10     19.2        6  168.   123  3.92  3.44  18.3     1     0     4     4