Search code examples
rbigdatasparse-matrixtmterm-document-matrix

row_sums vs findFreqTerms for subsetting TermDocMatrix to include words with a given min frequency


my question is straightforward. I have a (binary) TDM and I want to reduce the number of rows to include only those rows that appear in at least two documents:

I thought that these two methods would produce the same result in a binary matrix:

> rowTotals = row_sums(tdm)
> dtm2 <- tdm[which(rowTotals > 2),]
> dtm2
<<TermDocumentMatrix (terms: 208361, documents: 763717)>>
Non-/sparse entries: 34812736/159094025101
Sparsity           : 100%
Maximal term length: 154
Weighting          : binary (bin)

> #alternative probably faster:
> atleast2 <- findFreqTerms(tdm, lowfreq = 2)
> dtm2 <- tdm[atleast2,]
> dtm2
<<TermDocumentMatrix (terms: 340436, documents: 763717)>>
Non-/sparse entries: 35076886/259961683726
Sparsity           : 100%
Maximal term length: 308
Weighting          : binary (bin)

yet it is not so.. Could you help figuring out why it isn't?


Solution

  • They produce the exact same result. You have a mistake in your second part. You are taking the frequency of 2 and more, while in the first part you are taking all the words with a frequency of 3 and more. If make sure both selection criteria are the same you will see that they will produce the same result. See code example below. Also speed comparison.

    library(tm)
    
    data("crude")
    crude <- as.VCorpus(crude)
    crude <- tm_map(crude, stripWhitespace)
    crude <- tm_map(crude, content_transformer(tolower))
    crude <- tm_map(crude, removePunctuation)
    crude <- tm_map(crude, removeNumbers)
    crude <- tm_map(crude, removeWords, stopwords("english"))
    
    tdm <- TermDocumentMatrix(crude)
    
    # via row_totals
    row_totals <- slam::row_sums(tdm)
    dtm_via_rowtotals <- tdm[which(row_totals > 2),]
    
    <<TermDocumentMatrix (terms: 237, documents: 20)>>
    Non-/sparse entries: 864/3876
    Sparsity           : 82%
    Maximal term length: 13
    Weighting          : term frequency (tf)
    
    # via findFreqTerms
    freq_terms <- findFreqTerms(tdm, lowfreq = 3)
    dtm_via_freq_terms <- tdm[freq_terms, ]
    
    <<TermDocumentMatrix (terms: 237, documents: 20)>>
    Non-/sparse entries: 864/3876
    Sparsity           : 82%
    Maximal term length: 13
    Weighting          : term frequency (tf)
    

    Are they the same?

    all.equal(dtm_via_rowtotals, dtm_via_freq_terms)
    [1] TRUE
    

    Speed:

    microbenchmark::microbenchmark(row_totals = {rowtotals <- slam::row_sums(tdm); dtm_via_rowtotals <- tdm[which(rowtotals > 2),]},
                                   freq_terms = {freq_terms <- findFreqTerms(tdm, lowfreq = 3); dtm_via_freq_terms <- tdm[freq_terms, ]},
                                   times = 1000L)
    
    Unit: milliseconds
           expr    min     lq     mean median      uq     max neval
     row_totals 1.5039 1.6347 1.885161 1.7106 1.86085  9.3405  1000
     freq_terms 1.5696 1.6895 2.039345 1.7760 1.93525 99.0942  1000
    

    The selection via row_totals is slightly faster. But that is because findFreqTerms actually uses row_sums to get the info and has some extra lines of code to check if you pass it an document term matrix and if the frequencies you request are actual numbers.