Search code examples
rtmcorpusword-frequencyterm-document-matrix

Find frequency of specific words for individual documents in corpus - R, TermDocumentMatrix, TM


For a research project I am working on, I have read pdf documents into R, created a corpus and a TermDocumentMatrix. I want to check the frequency of specific words in each document in my corpus. The code below gives me the kind of matrix I want, with the frequency of words by document, but obviously it only does high frequency terms not specific terms.

ft <- findFreqTerms(tdm, lowfreq = 100, highfreq = Inf)
as.matrix(opinions.tdm[ft,])

I found the code below in another comment, which allows for searching the frequency of specific terms, however, it sums across the documents. How do I adapt this so that I am searching for the specific terms but within each document rather than across?

library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))


tdm <- TermDocumentMatrix(crude)

# turn tdm into dense matrix and create frequency vector. 
freq <- rowSums(as.matrix(tdm))
freq["crude"]
crude 
   21 
freq["oil"]
oil 
 85 

Solution

  • Skip the rowSums part and just refer to the matrix

    term_matrix <-as.matrix(tdm)
    term_matrix["crude",]
    # 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 
    #   2   0   2   3   0   2   0   0   0   0   5   2   0   2   0   0 
    # 502 543 704 708 
    #   0   2   0   1 
    term_matrix["oil",]
    # 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 
    #   5  12   2   1   1   7   3   3   5   9   5   4   5   4   3   4 
    # 502 543 704 708 
    #   5   3   3   1