Search code examples
rtext-miningtmword-frequencyqdap

Matching a list of phrases to a corpus of documents and returning phrase frequency


I have a list of phrases and a corpus of documents.There are 100k+ phrases and 60k+ documents in the corpus. The phrases are might/might not present in the corpus. I'm looking forward to find the term frequency of each phrase present in the corpus.

An example dataset:

Phrases <- c("just starting", "several kilometers", "brief stroll", "gradually boost", "5 miles", "dark night", "cold morning")
Doc1 <- "If you're just starting with workout, begin slow."
Doc2 <- "Don't jump in brain initial and then try to operate several kilometers without the need of worked out well before."
Doc3 <- "It is possible to end up injuring on your own and carrying out more damage than good."
Doc4 <- "Instead start with a brief stroll and gradually boost the duration along with the speed."
Doc5 <- "Before you know it you'll be working 5 miles without any problems."

I am new to text analytics in R and have approached this problem on the lines of Tyler Rinker's solution to this R Text Mining: Counting the number of times a specific word appears in a corpus?.

Here's my approach so far:

library(tm)
library(qdap)
Docs <- c(Doc1, Doc2, Doc3, Doc4, Doc5)
text <- removeWords(Docs, stopwords("english"))
text <- removePunctuation(text)
text <- tolower(text)
corp <- Corpus(VectorSource(text))
Phrases <- tolower(Phrases)
word.freq <- apply_as_df(corp, termco_d, match.string=Phrases)
mcsv_w(word.freq, dir = NULL, open = T, sep = ", ", dataframes = NULL,
        pos = 1, envir = as.environment(pos))

When I'm exporting the results in csv, it is only giving me whether phrase 1 is present in any of the docs or not.

I'm expecting an output as below (excluding the non-matching phrases):

Docs      Phrase1     Phrase2    Phrase3    Phrase4    Phrase5
1         0           1          2          0          0
2         1           0          0          1          0

Solution

  • I tried with your approach and can't replicate:

    Using:

    library(tm)
    library(qdap)
    Docs <- c(Doc1, Doc2, Doc3, Doc4, Doc5)
    text <- removeWords(Docs, stopwords("english"))
    text <- removePunctuation(text)
    text <- tolower(text)
    corp <- Corpus(VectorSource(text))
    Phrases <- tolower(Phrases)
    word.freq <- apply_as_df(corp, termco_d, match.string = Phrases)
    mcsv_w(word.freq, dir = NULL, open = T, sep = ", ", dataframes = NULL,
            pos = 1, envir = as.environment(pos))
    

    I get the following csv:

    docs    word.count  term(just starting) term(several kilometers)    term(brief stroll)  term(gradually boost)   term(5 miles)   term(dark night)    term(cold morning)
    1   7   1   0   0   0   0   0   0
    2   12  0   1   0   0   0   0   0
    3   7   0   0   0   0   0   0   0
    4   9   0   0   1   1   0   0   0
    5   7   0   0   0   0   0   0   0