big document term matrix - error when counting the number of characters of documents

I have built a big document-term matrix with the package RTextTools.

Now I am trying to count the number of characters in the matrix rows so that I can remove empty documents before performing topic modeling.

My code gives no errors when I apply it to a sample of my corpus, obtaining a smaller matrix, but when I try to count the row length of the documents in the matrix produced from my entire corpus (~75000 tweets) I get the following error message:

Error in vector(typeof(x$v), nr * nc) : 
  the dimension of the vector no cannot be NA
And: Warning message:
In nr * nc : NA produced by integer overflow

This is my code:

matrix <- create_matrix(data$clean_text, language="french", stemWords=TRUE, removeStopwords=TRUE, removeNumbers=TRUE, stripWhitespace=TRUE, toLower=TRUE, removePunctuation=TRUE, minWordLength=3)

rowTotals <- apply(matrix, 1, sum)

If I try with a matrix of 25000 documents I get the following error:

message: rowTotals <- apply(matrix, 1, sum) 
Errore: cannot allocate vector of size 7.1 Gb

Solution

You might be able to work around this if you keep your data in the dtm, which uses a sparse matrix representation that is much more memory efficient than a regular matrix.

The reason why the apply function gives an error is because it converts the sparse matrix into a regular matrix (the matrix object in your Q - btw it's poor style to give data objects names that are also names of functions, especially base functions). This means that R has to allocate memory for all the zeros in the dtm (which are typically mostly zeros, so that's a lot of memory with zeros in it). With a sparse matrix R doesn't need to store any of the zeros.

Here's the first few lines of the source for apply, see the last line here for the conversion to regular matrix:

apply
function (X, MARGIN, FUN, ...) 
{
    FUN <- match.fun(FUN)
    dl <- length(dim(X))
    if (!dl) 
        stop("dim(X) must have a positive length")
    if (is.object(X)) 
        X <- if (dl == 2L) 
            as.matrix(X) # this is where your memory gets filled with zeros

So how to avoid that conversion? Here's one way to loop over the rows to get their sums while keeping the sparse matrix format:

sapply(seq(nrow(matrix)), function(i) sum(matrix[i,]))
[1] 2 1 2 2 1

Subsetting this way preserves the sparse format and does not convert the object to the more memory expensive common matrix representation. We can check the representation:

str(matrix[1,])
List of 6
 $ i       : int [1:2] 1 1
 $ j       : int [1:2] 1 3
 $ v       : num [1:2] 1 1
 $ nrow    : int 1
 $ ncol    : int 6
 $ dimnames:List of 2
  ..$ Docs : chr "1"
  ..$ Terms: chr [1:6] "document" "file" "first" "second" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

So in the sapply function we are always working on a sparse matrix. And even if sum (or whatever function you use there) does some kind of conversion, it's only going to be converting one row of the dtm, rather than the entire thing.

The general principle when working with largish text data in R is to keep your dtm as a sparse matrix and then you should be able to keep within memory limits.