Search code examples
rdplyrfiltertidyversestringr

str_detect for multiple pattern within each row


Is there a way to filter rows based on match on 2 strings. For eg I want to get all the rows with name that contain won and le.

df <- data.frame(name = c("Cathy Wu","won Xion le","Matt le won","stephen leuig"),
                 value = 5:4)

name    value
<chr>   <int>
Cathy le    5
won Xion le 6
Matt le won 7
stephen won 8
James Matt  9

The output that I am looking for is;

name    value
<chr>   <int>
won Xion le 6
Matt le won 7

If I try df %>% filter(str_detect(name,"won|le")) then the result is as follows, as here it is doing an or (|)

name    value
<chr>   <int>
Cathy le    5
won Xion le 6
Matt le won 7
stephen won 8

What I am looking for is something like "won&&le". Can I achieve this using str_detect.


Solution

  • Here are a few different ways of doing it:

    filter(df, str_detect(name, "won"), str_detect(name, "le")) # using multiple str_detect calls
    filter(df, str_detect(name, "(?=.*won)(?=.*le)")) #  using lookaheads
    filter(df, str_detect(name,"won.*le|le.*won")) # jared's first answer
    filter(df, str_detect(name, "won") & str_detect(name, "le")) # another way similar to #1
    

    To match the word, and not match the strings as part of larger words, as Jared commented, you can add a '\b' on either side of each word you're looking for, e.g.:

    filter(df, str_detect(name, "(?=.*\\bwon\\b)(?=.*\\ble\\b)"))