Search code examples
rstringtidyrstringr

Efficient way to add numbers to alphanumeric strings in R


I have a data.frame with ids composed of sequences of alphanumeric characters (e.g., id = c(A001, A002, B013)). I was looking for an easy function under stringr or stirngi that would easily do math with this strings (id + 1 should return c(A002, A003, B014)).

I made a custom function that does the trick, however I have a feeling that there must be a better/more efficient/within package way to achieve this.

str_add_n <- function(df, string, n, width=3){

  string <- enquo(string)

  ## split the string using pattern
df <-  df %>%
    separate(!!string,
             into = c("text", "num"), 
             sep = "(?<=[A-Za-z])(?=[0-9])",
             remove=FALSE
    ) %>%
    mutate(num = as.numeric(num),
           num = num + n,
           num = stringr::str_pad(as.character(num),
                                  width = width,
                                  side = "left",
                                  pad = 0 
                                  )
           ) %>%
    unite(next_string, text:num, sep = "")


return(df)  
}

Let's make a toy df

df <- data.frame(id = c("A001", "A002", "B013"))
str_add_n(df, id, 1)
    id next_string
1 A001        A002
2 A002        A003
3 B013        B014

Again, this works, I'm wondering if there's a better way to do this, all tweaks welcome!

UPDATE

Based on the suggested answers I ran some benchmarking and it appears that both come very close, I would be inclined for the str_add_n_2 (I changed the name to be able to run both, and took the suggestion of x<-as.character(x))

microbenchmark::microbenchmark(question = str_add_n(df, id, 1),
 answer = df %>% mutate_at(vars(id), funs(str_add_n_2(., 1))),
 string_add = df %>% mutate_at(vars(id), funs(string_add(as.character(.)))))

Which yields

Unit: milliseconds
       expr      min       lq     mean   median       uq
   question 4.312094 4.448391 4.695276 4.570860 4.755748
     answer 2.932146 3.017874 3.191262 3.117627 3.240688
 string_add 3.388442 3.466466 3.699363 3.534416 3.682762
      max neval cld
 10.29253   100   c
  8.24967   100 a  
  9.05441   100  b 

More tweaks are welcome!


Solution

  • I'd suggest it's easier to define the function based on a vector of strings and not hard-code it to looking for columns in the frame; for the latter, you can always use something like mutate_at(vars(id,...), funs(str_add_n)).

    str_add_n <- function(x, n = 1L) {
      gr <- gregexpr("\\d+", x)
      reg <- regmatches(x, gr)
      widths <- nchar(reg)
      regmatches(x, gr) <- sprintf(paste0("%0", widths, "d"), as.integer(reg) + n)
      x
    }
    
    vec <- c("A001", "A002", "B013")
    str_add_n(vec)
    # [1] "A002" "A003" "B014"
    

    If in a frame:

    df <- data.frame(id = c("A001", "A002", "B013"), x = 1:3,
                     stringsAsFactors = FALSE)
    library(dplyr)
    df %>%
      mutate_at(vars(id), funs(str_add_n(., 3)))
    #     id x
    # 1 A004 1
    # 2 A005 2
    # 3 B016 3
    

    Caveat: this silently requires true character, not factor ... a possible defensive tactic might be to add x <- as.character(x) in the function definition.


    Update: mutate_at has been superseded, the preferred use with across is:

    df %>% mutate(across(c(id), ~ str_add_n(., 3)))
    

    or more directly

    df %>% mutate(id = str_add_n(id, 3))