I have a list of phrases, and a list of the most frequent terms found in those phrases. I want to filter the original list, keeping only strings that contain one of the terms from my second list.
Here is what I have so far:
#Set data source, format for use, check consistency
MyData <- c('Create company email', 'email for business', 'free trial', 'corporate pricing', 'email cost')
#Create corpus from csv
corpus <- Corpus(VectorSource(MyData$Keyword))
#Clean corpus
cleanset1 <- tm_map(corpus, tolower)
cleanset2 <- tm_map(cleanset1, removeNumbers)
cleanset3 <- tm_map(cleanset2, removeWords, stopwords('english'))
cleanset4 <- tm_map(cleanset3, removePunctuation)
#Convert to Term Document Matrix
tdm <- TermDocumentMatrix(cleanset4)
#Find Freq
freqterms<-as.list(findFreqTerms(tdm,20))
At this point I have a list of most frequent terms (using the tm package), and my original list. What would be the best way to remove any value from the original list that doesn't include one of the terms from the freqterms list?
Would something along the lines of
filtered <-MyData[!(MyData %in% freqterms)]
work?
If I am understanding your data structure correctly, freqterms is a list where each element is just a term. If so, it may be easier to convert freqterms to a vector.
freqterms <- unlist(freqterms)
You likely need to use grep to look for your frequent terms in your data, because %in%
will only work if the two elements are the same.
You first need to format freqterms as proper regex.
freqterms.regex <- paste0("(", paste0(freqterms, collapse="|"), ")")
This will put your frequent terms in the format of "(term1|term2|term3|...)"
. You can then use this as the pattern along with grepl to keep only the entries in MyData that have a match.
matches <- MyData[grepl(MyData, pattern=freqterms.regex)]
You may need to make the regex more stringent depending on what your MyData and freqterms look like.