Search code examples
rdata-analysistext-processingtm

How to remove the empty documents from the Document Term Matrix in R


I have got empty documents in my document term matrix. I need to remove them. This is code that I used to build the DocumentTermMatrix:

 tweets_dtm_tfidf <- DocumentTermMatrix(tweet_corpus, control = list(weighting = weightTfIdf))

And this the warning Message that I am getting:

Warning message:
In weighting(x) :
  empty document(s): 823 3795 4265 7252 7295 7425 8240 8433 9303 12160 12278 14465 15166 15485 15933 20775 21666 21807 26131 27039 34035 34050 34101

I tried removing these empty documents using this code:

rowTotals <- apply(tweets_dtm_tfidf , 1, sum)
dtm_tfidf   <- tweets_dtm_tfidf[rowTotals> 0, ]

Here is the error that I am getting trying to remove them:

> rowTotals <- apply(tweets_dtm_tfidf , 1, sum)

Error: cannot allocate vector of size 6.8 Gb

Any idea on how to go about this? Thanks for any suggestions in advance.


Solution

  • The sum in apply transforms your sparse matrix into a dense matrix and this eats up a lot of memory if it is a big sparse matrix.

    And the apply function is not needed. There are functions for sparse matrices. Since the dtm is a simple_triplet_matrix you can use the row_sums from slam.

    The following should work.

    rowTotals <- slam::row_sums(tweets_dtm_tfidf)
    dtm_tfidf <- dtm_tfidf[rowTotals > 0, ]
    

    But remember anything you do to get your data out of sparse matrix might result in big memory hog object if you have a lot of words. You might want to use removeSparseTerms before moving on.