Search code examples
rdplyrpiperename

How to rename multiple columns that have numeric column names using dplyr?


Setup

Suppose I have a dataframe in which several columns have actual numeric values as the column names instead of text-based column names. For example, the column names would be `2015` instead of "2015".

Here is a reproducible example:

my_df = structure(list(Col1 = c('a', 'b', 'c'), 
                       Col2 = c('d', 'e', 'f'), 
                       `2015` = c('g','h','i'), 
                       `2016` = c('j','k','l'), 
                       `2017` = c('m','n','o'), 
                       `2018` = c('p','q','r'), 
                       `2019` = c('s','t','u'), 
                       `2020` = c('v','w','x'), 
                       `2021` = c('y','z','zz')), 
               row.names = c(NA, -3L), 
               class = c("tbl_df", "tbl", "data.frame"))

Question

Suppose further that I want to rename all of the numeric column names to something friendlier. Say, for example, from `2015` to "XYZ_2015" (note the conversion to string) and similar conversions to columns `2016`, `2017`, ... `2021`.

How can I perform this column renaming in a way that uses dplyr's piping operator (%>%) and doesn't require me writing them all out manually?

My current solution

Up till now, my approach has been to do this "manually", renaming each column individually:

new_df = my_df %>%
  rename(XYZ_2015 = `2015`,
         XYZ_2016 = `2016`,
         XYZ_2017 = `2017`,
         XYZ_2018 = `2018`,
         XYZ_2019 = `2019`,
         XYZ_2020 = `2020`,
         XYZ_2021 = `2021`)

However, this method is cumbersome and prone to errors. Is there a way for me to do so in a more automated way? I feel like a for-loop would work here, but I can't figure out the syntax to make the for-loop's variable play nice with the the backticks.

For example, I've tried this:

for(year in 2015:2021){
  print(year)
  new_colname = paste0('XYZ_',year)
  my_df = my_df %>% rename(`new_colname` = `year`)
}

But this yields an error:

Error in `stop_subscript()`:
! Can't rename columns that don't exist.
x Location 2015 doesn't exist.
i There are only 9 columns.
---
Backtrace:
  1. my_df %>% rename(new_colname = year)
  3. dplyr:::rename.data.frame(., new_colname = year)
  4. tidyselect::eval_rename(expr(c(...)), .data)
  5. tidyselect:::rename_impl(...)
  6. tidyselect:::eval_select_impl(...)
 15. tidyselect:::vars_select_eval(...)
 16. tidyselect:::loc_validate(pos, vars, call = error_call)
 17. vctrs::vec_as_location(pos, n = length(vars))
 18. vctrs `<fn>`()
 19. vctrs:::stop_subscript_oob(...)
 20. vctrs:::stop_subscript(...)

Solution

  • We can use rename_with and paste (str_c) the prefix XYZ_ with the column name (.x) only for those column names that matches with 4 digit (\\d{4}) column name from start (^) to end ($) of string

    library(dplyr)
    library(stringr)
    my_df %>%
       rename_with(~ str_c("XYZ_", .x), matches("^\\d{4}$"))
    

    -output

    # A tibble: 3 × 9
      Col1  Col2  XYZ_2015 XYZ_2016 XYZ_2017 XYZ_2018 XYZ_2019 XYZ_2020 XYZ_2021
      <chr> <chr> <chr>    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
    1 a     d     g        j        m        p        s        v        y       
    2 b     e     h        k        n        q        t        w        z       
    3 c     f     i        l        o        r        u        x        zz      
    

    Note that rename_with usage is

    rename_with(.data, .fn, .cols = everything(), ...)

    and .cols are specified as

    .cols - <<tidy-select>> Columns to rename; defaults to all columns.

    This implies that we can make use of any of the tidy-select helper functions (matches/starts_with/ends_with/everything()) etc for selecting the columns


    Or with base R

    names(my_df) <- sub("^X", "XYZ_", make.names(names(my_df)))