I have got empty documents in my document term matrix. I need to remove them. This is code that I used to build the DocumentTermMatrix:
tweets_dtm_tfidf <- DocumentTermMatrix(tweet_corpus, control = list(weighting = weightTfIdf))
And this the warning Message that I am getting:
Warning message:
In weighting(x) :
empty document(s): 823 3795 4265 7252 7295 7425 8240 8433 9303 12160 12278 14465 15166 15485 15933 20775 21666 21807 26131 27039 34035 34050 34101
I tried removing these empty documents using this code:
rowTotals <- apply(tweets_dtm_tfidf , 1, sum)
dtm_tfidf <- tweets_dtm_tfidf[rowTotals> 0, ]
Here is the error that I am getting trying to remove them:
> rowTotals <- apply(tweets_dtm_tfidf , 1, sum)
Error: cannot allocate vector of size 6.8 Gb
Any idea on how to go about this? Thanks for any suggestions in advance.
The sum in apply transforms your sparse matrix into a dense matrix and this eats up a lot of memory if it is a big sparse matrix.
And the apply
function is not needed. There are functions for sparse matrices. Since the dtm is a simple_triplet_matrix
you can use the row_sums from slam.
The following should work.
rowTotals <- slam::row_sums(tweets_dtm_tfidf)
dtm_tfidf <- dtm_tfidf[rowTotals > 0, ]
But remember anything you do to get your data out of sparse matrix might result in big memory hog object if you have a lot of words. You might want to use removeSparseTerms
before moving on.