Search code examples
rtmcorpusword-frequencyterm-document-matrix

Find frequency of a custom word in R TermDocumentMatrix using TM package


I turned about 50,000 rows of varchar data into a corpus, and then proceeded to clean said corpus using the TM package, getting ride of stopwords, punctuation, and numbers.

I then turned it into a TermDocumentMatrix and used the functions findFreqTerms and findMostFreqTerms to run text analysis. findMostFreqTerms return the common words, and the number of times it shows up in the data.

However, I want to use a function that says search for "word" and return how many times "word" appears in the TermDocumentMatrix.

Is there a function in TM that achieves this? Do I have to change my data to a data.frame and use a different package & function?


Solution

  • Since you have not given a reproducible example, I will give one using the crude dataset available in the tm package.

    You can do it in (at least) 2 different ways. But anything that turns a sparse matrix into a dense matrix can use a lot of memory. So I will give you 2 options. The first one is more memory friendly as it makes use of the sparse tdm matrix. The second one, first transforms the tdm into a dense matrix before creating a frequency vector.

    library(tm)
    data("crude")
    crude <- as.VCorpus(crude)
    crude <- tm_map(crude, stripWhitespace)
    crude <- tm_map(crude, removePunctuation)
    crude <- tm_map(crude, content_transformer(tolower))
    crude <- tm_map(crude, removeWords, stopwords("english"))
    
    
    tdm <- TermDocumentMatrix(crude)
    
    # Making use of the fact that a tdm or dtm is a simple_triplet_matrix from slam
    my_func <- function(data, word){
      slam::row_sums(data[data$dimnames$Terms == word, ])
    }
    
    my_func(tdm, "crude")
    crude 
       21 
    my_func(tdm, "oil")
    oil 
     85
    
    # turn tdm into dense matrix and create frequency vector. 
    freq <- rowSums(as.matrix(tdm))
    freq["crude"]
    crude 
       21 
    freq["oil"]
    oil 
     85 
    

    edit: As requested in comment:

    # all words starting with cru. Adjust regex to find what you need.
    freq[grep("^cru", names(freq))]
    crucial   crude 
          2      21 
    
    # separate words
    freq[c("crude", "oil")]
    crude   oil 
       21    85