Search code examples
rregexstringtidyverse

Find the first matching word from a vector in a string column


I need to know which of the words in a vector comes first in a string. I need to run this code on a large data frame with millions of records.

df is my sample data

df <- data.frame(ID = c(1,2,3),
Text = c("A basket of fruits having apples, green bananas, and peaches",
"A basket of fruits having green bananas, apples, and peaches",
"A basket of fruits having peaches, green bananas, and apples"))

The words I am looking to match are in a vector

vec <- c("green bananas", "apples", "peaches")

I want a result column for each record like this

df$Result 
"apples", "green bananas", "peaches"

Solution

  • You can use regmatches + regexpr like below

    transform(
        df,
        Result = regmatches(Text, regexpr(paste0(vec, collapse = "|"), Text))
    )
    

    or str_extract

    df %>%
        mutate(Result = str_extract(Text, paste0(vec, collapse = "|")))
    

    which gives

      ID                                                         Text        Result
    1  1 A basket of fruits having apples, green bananas, and peaches        apples
    2  2 A basket of fruits having green bananas, apples, and peaches green bananas
    3  3 A basket of fruits having peaches, green bananas, and apples       peaches