Search code examples
rstringfiltergreplcontain

Check if a string contains at least n words out of a list of words R


I only found solutions in Python / Java for this question.

I have a data.frame with press articles and the corresponding dates. I further have a list of keywords that I want to check each article for.

df <- data.frame(c("2015-05-06", "2015-05-07", "2015-05-08", "2015-05-09"), 
                 c("Articel does not contain a key word", "Articel does contain the key word revenue", "Articel does contain two keywords revenue and margin","Articel does not contain the key word margin"))
colnames(df) <- c("date","article")

key.words <- c("revenue", "margin", "among others")

I came up with a nice solution, if I only want to check if one of the words is contained in an article:

article.containing.keyword <- filter(df, grepl(paste(key.words, collapse="|"), df$article))

This works well,but what I am actually looking for, is a solution where I can set a threshold a la "article must contain at least n words in order to be filtered", for example, an article must contain at least n = 2 keywords to get selected by the filter. The desired output would like like this:

  date       article
3 2015-05-08 Articel does contain two keywords revenue and margin

Solution

  • You could use stringr::str_count :

    str_count(df$article, paste(key.words, collapse="|"))
    [1] 0 1 2 1
    

    That could be translated to filter this way :

    article.containing.keyword <- dplyr::filter(df, str_count(df$article, paste(key.words, collapse="|")) >= 2)
            date                                              article
    1 2015-05-08 Articel does contain two keywords revenue and margin