I only found solutions in Python / Java for this question.
I have a data.frame with press articles and the corresponding dates. I further have a list of keywords that I want to check each article for.
df <- data.frame(c("2015-05-06", "2015-05-07", "2015-05-08", "2015-05-09"),
c("Articel does not contain a key word", "Articel does contain the key word revenue", "Articel does contain two keywords revenue and margin","Articel does not contain the key word margin"))
colnames(df) <- c("date","article")
key.words <- c("revenue", "margin", "among others")
I came up with a nice solution, if I only want to check if one of the words is contained in an article:
article.containing.keyword <- filter(df, grepl(paste(key.words, collapse="|"), df$article))
This works well,but what I am actually looking for, is a solution where I can set a threshold a la "article must contain at least n words in order to be filtered", for example, an article must contain at least n = 2 keywords to get selected by the filter. The desired output would like like this:
date article
3 2015-05-08 Articel does contain two keywords revenue and margin
You could use stringr::str_count
:
str_count(df$article, paste(key.words, collapse="|"))
[1] 0 1 2 1
That could be translated to filter this way :
article.containing.keyword <- dplyr::filter(df, str_count(df$article, paste(key.words, collapse="|")) >= 2)
date article
1 2015-05-08 Articel does contain two keywords revenue and margin