my question is straightforward. I have a (binary) TDM and I want to reduce the number of rows to include only those rows that appear in at least two documents:
I thought that these two methods would produce the same result in a binary matrix:
> rowTotals = row_sums(tdm)
> dtm2 <- tdm[which(rowTotals > 2),]
> dtm2
<<TermDocumentMatrix (terms: 208361, documents: 763717)>>
Non-/sparse entries: 34812736/159094025101
Sparsity : 100%
Maximal term length: 154
Weighting : binary (bin)
> #alternative probably faster:
> atleast2 <- findFreqTerms(tdm, lowfreq = 2)
> dtm2 <- tdm[atleast2,]
> dtm2
<<TermDocumentMatrix (terms: 340436, documents: 763717)>>
Non-/sparse entries: 35076886/259961683726
Sparsity : 100%
Maximal term length: 308
Weighting : binary (bin)
yet it is not so.. Could you help figuring out why it isn't?
They produce the exact same result. You have a mistake in your second part. You are taking the frequency of 2 and more, while in the first part you are taking all the words with a frequency of 3 and more. If make sure both selection criteria are the same you will see that they will produce the same result. See code example below. Also speed comparison.
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, removeWords, stopwords("english"))
tdm <- TermDocumentMatrix(crude)
# via row_totals
row_totals <- slam::row_sums(tdm)
dtm_via_rowtotals <- tdm[which(row_totals > 2),]
<<TermDocumentMatrix (terms: 237, documents: 20)>>
Non-/sparse entries: 864/3876
Sparsity : 82%
Maximal term length: 13
Weighting : term frequency (tf)
# via findFreqTerms
freq_terms <- findFreqTerms(tdm, lowfreq = 3)
dtm_via_freq_terms <- tdm[freq_terms, ]
<<TermDocumentMatrix (terms: 237, documents: 20)>>
Non-/sparse entries: 864/3876
Sparsity : 82%
Maximal term length: 13
Weighting : term frequency (tf)
Are they the same?
all.equal(dtm_via_rowtotals, dtm_via_freq_terms)
[1] TRUE
Speed:
microbenchmark::microbenchmark(row_totals = {rowtotals <- slam::row_sums(tdm); dtm_via_rowtotals <- tdm[which(rowtotals > 2),]},
freq_terms = {freq_terms <- findFreqTerms(tdm, lowfreq = 3); dtm_via_freq_terms <- tdm[freq_terms, ]},
times = 1000L)
Unit: milliseconds
expr min lq mean median uq max neval
row_totals 1.5039 1.6347 1.885161 1.7106 1.86085 9.3405 1000
freq_terms 1.5696 1.6895 2.039345 1.7760 1.93525 99.0942 1000
The selection via row_totals is slightly faster. But that is because findFreqTerms
actually uses row_sums
to get the info and has some extra lines of code to check if you pass it an document term matrix and if the frequencies you request are actual numbers.