r bigdata sparse-matrix tm term-document-matrix

row_sums vs findFreqTerms for subsetting TermDocMatrix to include words with a given min frequency

my question is straightforward. I have a (binary) TDM and I want to reduce the number of rows to include only those rows that appear in at least two documents:

I thought that these two methods would produce the same result in a binary matrix:

> rowTotals = row_sums(tdm)
> dtm2 <- tdm[which(rowTotals > 2),]
> dtm2
<<TermDocumentMatrix (terms: 208361, documents: 763717)>>
Non-/sparse entries: 34812736/159094025101
Sparsity           : 100%
Maximal term length: 154
Weighting          : binary (bin)

> #alternative probably faster:
> atleast2 <- findFreqTerms(tdm, lowfreq = 2)
> dtm2 <- tdm[atleast2,]
> dtm2
<<TermDocumentMatrix (terms: 340436, documents: 763717)>>
Non-/sparse entries: 35076886/259961683726
Sparsity           : 100%
Maximal term length: 308
Weighting          : binary (bin)

yet it is not so.. Could you help figuring out why it isn't?

Solution

They produce the exact same result. You have a mistake in your second part. You are taking the frequency of 2 and more, while in the first part you are taking all the words with a frequency of 3 and more. If make sure both selection criteria are the same you will see that they will produce the same result. See code example below. Also speed comparison.

library(tm)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, removeWords, stopwords("english"))

tdm <- TermDocumentMatrix(crude)

# via row_totals
row_totals <- slam::row_sums(tdm)
dtm_via_rowtotals <- tdm[which(row_totals > 2),]

<<TermDocumentMatrix (terms: 237, documents: 20)>>
Non-/sparse entries: 864/3876
Sparsity           : 82%
Maximal term length: 13
Weighting          : term frequency (tf)

# via findFreqTerms
freq_terms <- findFreqTerms(tdm, lowfreq = 3)
dtm_via_freq_terms <- tdm[freq_terms, ]

<<TermDocumentMatrix (terms: 237, documents: 20)>>
Non-/sparse entries: 864/3876
Sparsity           : 82%
Maximal term length: 13
Weighting          : term frequency (tf)

Are they the same?

all.equal(dtm_via_rowtotals, dtm_via_freq_terms)
[1] TRUE

Speed:

microbenchmark::microbenchmark(row_totals = {rowtotals <- slam::row_sums(tdm); dtm_via_rowtotals <- tdm[which(rowtotals > 2),]},
                               freq_terms = {freq_terms <- findFreqTerms(tdm, lowfreq = 3); dtm_via_freq_terms <- tdm[freq_terms, ]},
                               times = 1000L)

Unit: milliseconds
       expr    min     lq     mean median      uq     max neval
 row_totals 1.5039 1.6347 1.885161 1.7106 1.86085  9.3405  1000
 freq_terms 1.5696 1.6895 2.039345 1.7760 1.93525 99.0942  1000

The selection via row_totals is slightly faster. But that is because findFreqTerms actually uses row_sums to get the info and has some extra lines of code to check if you pass it an document term matrix and if the frequencies you request are actual numbers.