Search code examples
rdplyrgrepl

R grepl with dynamic search pattern


I have a data frame, df, having one column of different names. I have variable data frames, e.g. search_df or search_df1 containing search words which I would like to search via regex in the name column. If the word has been found write it into a new column, e.g. df_final$which_word_search_df. If more than one word has been found I would like to paste the results together. The result should look like df_final.

# load packages
pacman::p_load(tidyverse)

# words I would like to search for
search_df <- data.frame(search_words = c("apple", "peach"))
search_df1 <- data.frame(search_words = c("strawberry", "peach", "banana"))

# data frame which is the basis for my search
df <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"))

# how I expect the final result to look like
df_final <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"),
                       which_word_search_df = c("apple", "apple; peach", "peach", "peach", NA, NA),
                       which_word_search_df1 = c(NA, NA, "peach", "peach", "banana", "banana"))

That is my current solution but as you can see it is not dynamic. I type in manually every search word instead of automatically going through all the search words.

df_trial <- df %>% 
  mutate(which_search_word_trial = ifelse(grepl("apple", name, ignore.case = T), "apple", ""),
         which_search_word_trial = ifelse(grepl("peach", name, ignore.case = T), 
                                          paste(which_search_word_trial, "peach", sep = ";"), which_search_word_trial)
  )

The example I am sharing is just a minimal one. For the actual use case df will have ~200k rows and my search_df will have ~1k rows.


Solution

  • We can do the following.

    library(dplyr)
    library(stringr)
    
    df %>%
      mutate(which_word_search_df = str_extract_all(name,str_c(search_df$search_words, collapse = '|')),
             which_word_search_df1 = str_extract_all(name, str_c(search_df1$search_words, collapse = '|')))
    
    #         name which_word_search_df which_word_search_df1
    # 1   apple123                apple                      
    # 2 applepeach         apple, peach                 peach
    # 3  peachtime                peach                 peach
    # 4    peachab                peach                 peach
    # 5  bananarrr                                     banana
    # 6   bananaxy                                     banana