Search code examples
rregexstringtidyversestringr

Extract *all* possible patterns in a variable


I have a large variable containing strings (words). I need to extract all substrings that contain any of the patters listed in a separate vector.

library(tidyverse)

df <- data.frame(Word = c("hope", "freeze", "free"))

patterns <- "hope|freeze|free|du|li|un|de|em|bi|en|im|ro|gi|ai|ag|wo|ab|di|ac|eu|ic|se|al|ob|ig|es|ef|sy|ep|ec|y|u|e|o|a|h|i"

df %>%
  mutate(simple = str_extract_all(Word, patterns))

However, it looks like the function returns the most complete string depending on the order the patterns are in. So, for example, if patterns has the order shown above, the result will be:

    Word simple
1   hope   hope
2 freeze freeze
3   free   free

If the order is reversed (i.e., descending order with respect to length:

patterns2 <-"y|u|e|o|a|h|i|du|li|un|de|em|bi|en|im|ro|gi|ai|ag|wo|ab|di|ac|eu|ic|se|al|ob|ig|es|ef|sy|ep|ec|hope|freeze|free"

df %>%
  mutate(simple = str_extract_all(Word, patterns2))

  Word   simple
1   hope h, o, e
2 freeze  freeze
3   free    free

Is there a way to get all potential patterns, regardless of the order of the patterns? Here's the desired output:

    Word  simple
1   hope  h, o, e, hope
2 freeze  freeze
3   free    free

Solution

  • You can split the pattern into a vector of sub-patterns, and then extract the elements included in each word.

    pat_vec <- str_split_1(patterns, fixed('|'))
    
    df %>%
      mutate(simple = lapply(Word, \(x) pat_vec[str_which(x, pat_vec)]))
    
    #     Word          simple
    # 1   hope   hope, e, o, h
    # 2 freeze freeze, free, e
    # 3   free         free, e