Search code examples
rregexstring

Combining Regex and Non-Regex in the same function


I have a dataframe:

mydf <- data.frame(
  col1 = c("54", "abc", "123", "54 abc", "zzz", "a", "99"),
  col2 = c("100", "200", "300", "400", "500", "600", "700"),
  stringsAsFactors = FALSE
)

In this dataframe, I want to replace all elements with NA unless they meet one of these conditions:

  • strictly a number (e.g. "54" keep, "54 abc" discard)
  • belong to target_string

I was not sure how to do this in R using apply, so I tried to write a loop:

target_string <- c("a", "zzz")

replace_with_na_old <- function(df, target_string) {
  for (i in 1:nrow(df)) {
    for (j in 1:ncol(df)) {
      value <- df[i, j]
      if (!grepl("^[0-9]+$", value) && !(value %in% target_string)) {
        df[i, j] <- NA
      }
    }
  }
  return(df)
}

mydf_cleaned_old <- replace_with_na_old(mydf, target_string)

Is there another way to do this?

Note: Here is how to replace %in% with %like%:

   replace_with_na_new <- function(df, target_string) {
  for (i in 1:nrow(df)) {
    for (j in 1:ncol(df)) {
      value <- df[i, j]
      if (!grepl("^[0-9]+$", value) && !any(sapply(target_string, function(pattern) grepl(pattern, value)))) {
        df[i, j] <- NA
      }
    }
  }
  return(df)
}

Solution

  • You already have the necessary logic to check this, all you need is to vectorize it.

    replace_with_na <- function(value, target_string) {
      value[!(grepl('^\\d+$', value) | value %in% target_string)] <- NA
      value
    }
    

    Now you can apply this function for each column using any of the apply* functions in base R.

    new_df <- mydf
    new_df[] <- lapply(mydf, replace_with_na, target_string)
    new_df
    
    #  col1 col2
    #1   54  100
    #2 <NA>  200
    #3  123  300
    #4 <NA>  400
    #5  zzz  500
    #6    a  600
    #7   99  700
    

    Or if you prefer dplyr we can use across for similar result.

    library(dplyr)
    mydf %>% mutate(across(everything(), \(x) replace_with_na(x, target_string)))