Search code examples
rstringr

Issue with stringr::str_replace function when patterns are present as a column in dataframe


I have a requirement to replace the following values to corresponding vector elements in the data frame.

Column a : c("abc*^!", "abcde+", "abcde123********++++++", "Post TCZ 6 hours (+-3hrs)")

Column b : c('xx', 'yy', 'zz', 'aa')

If I directly do it using stringr::str_replace_all function then it does not work as the * and + symbols present in my column a is being treated as regex patterns. Inorder to achieve this I created a function that escapes the special characters to make sure that the pattern matching works as expected and so does the string replacement. Looks like str_replace does not like to read the patterns from a column in the data frame. Is there a way to achieve this?

Note: I am using this method in continuation to the existing code as this is present in git and used by many other teams.

Here is the evidence that regex pattern (column VISIT_INSTACE1) created by the escape_spl_chars function is creating the correct matching patterns for column a. Please let me know if someone can throw some light on this.

enter image description here

x <- data.frame(a = c("abc*^!", "abcde+", "abcde123********++++++", "Post TCZ 6 hours (+-3hrs)"),
                b = c('xx', 'yy', 'zz', 'aa'), stringsAsFactors = F)

escape_spl_chars <- function(arg1){
  
  return_x <- sapply(strsplit(arg1, "", fixed = TRUE), function(y) {
    pasted_chars <- sapply(y, function(char) {
      # Convert character to ASCII code
      ascii_code <- as.integer(charToRaw(char))
      
      # Check if it's a special character and escape it
      if ((ascii_code >= 33 & ascii_code <= 47) | 
          (ascii_code >= 58 & ascii_code <= 64) | 
          (ascii_code >= 91 & ascii_code <= 96) | 
          (ascii_code >= 123 & ascii_code <= 126)) {
        return(paste0("\\\\", char))  # Escape special character
      } else if (ascii_code == 32){
        return(paste0("\\\\", 's')) # Escape space character
      } else {
        return(char)  # Return normal character
      }
    })
    
    # Collapse the characters back into a single string
    paste0(pasted_chars, collapse = "")
  })
  
  return(return_x)
  
}

x1 <- x %>% mutate(VISIT_INSTACE1 = escape_spl_chars(a))

x1 <- x1 %>% dplyr::mutate(newvisitcode = stringr::str_replace_all(a, stringr::str_trim(VISIT_INSTACE1), b))

Solution

  • Use fixed to compare literal characters.

    library(stringr)
    library(dplyr)
    
    x %>% mutate(newvisitcode=str_replace_all(string=a, pattern=fixed(a), replacement=b))
    
                              a  b newvisitcode
    1                    abc*^! xx           xx
    2                    abcde+ yy           yy
    3    abcde123********++++++ zz           zz
    4 Post TCZ 6 hours (+-3hrs) aa           aa