Search code examples

Find documents that include one of a list of words in R

I have two dataframes: msnbc contains a column of news transcripts called text and dictionary contains a column of words called search. I want to return a new dataframe that includes all rows of msnbc where the text field contains one or more words from the search column. Toy data:

msnbc <- data.frame(id=c(1,2,3), text=c("hello world", "goodbye world","hello friends"))
dictionary <- data.frame(search=c("hello","lorem","ipsum","dolor")

The new dataset should include the first and third element of msnbc because they include one of the words from dictionary$search

My first thought was to use str_detect but there is no option for passing a vector of strings as the pattern. My other idea was to use filter somehow but not sure how to implement:

new_msnbc <- msnbc %>%
    filter(dictionary$search %in% text)

But this doesn't work as intended. What is the best way to do this? Bonus points for a tidyverse solution.


  • It appears you can do this with filter and grepl:

    result <- msnbc %>%
    filter(grepl(paste(dictionary$search, collapse="|"), text))