Search code examples
rdataframeldaquanteda

How to remove zero entries in a DFM when the matrix is too big for usual manipulation?


I have the following problem: I converted a corpus into a dfm and this dfmm has some zero entries that I need to remove before fitting a LDA model. I would usually do as follows:

OutDfm <- dfm_trim(dfm(corpus, tolower = TRUE, remove = c(stopwords("english"), stopwords("german"), stopwords("french"), stopwords("italian")), remove_punct = TRUE, remove_numbers = TRUE, remove_separators = TRUE, stem = TRUE, verbose = TRUE), min_docfreq = 5)

Creating a dfm from a corpus input...
   ... lowercasing
   ... found 272,912 documents, 112,588 features
   ... removed 613 features
   ... stemming features (English)
, trimmed 27491 feature variants
   ... created a 272,912 x 84,515 sparse dfm
   ... complete. 
Elapsed time: 78.7 seconds.


# remove zero-entries
raw.sum=apply(OutDfm,1,FUN=sum)
which(raw.sum == 0)
OutDfm = OutDfm[raw.sum!=0,]

However, when I try to perform the last operations I get: Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105 hinting at the fact the the matrix is too large to be manipulated.

Is there anyone who has met and solved this issue before? Any alternative strategy to remove the 0 entries?

Thanks a lot!


Solution

  • Your apply with sum transforms the dfm from a sparse matrix into a dense matrix for calculating the row sum.

    Either use slam::row_sums since slam functions work on sparse matrices, but better yet, just use quantada::dfm_subset to select all the documents with more than 0 tokens.

    dfm_subset(OutDfm, ntoken(OutDfm) > 0)
    

    Example to show how it works with ntokens > 5000:

    library(quanteda)
    x <- corpus(data_corpus_inaugural)
    x <- dfm(x)
    x
    Document-feature matrix of: 58 documents, 9,360 features (91.8% sparse) and 4 docvars.
                     features
    docs              fellow-citizens  of the senate and house representatives : among vicissitudes
      1789-Washington               1  71 116      1  48     2               2 1     1            1
    
    # subset based on amount of tokens.
    dfm_subset(x, ntoken(x) > 5000)
    Document-feature matrix of: 3 documents, 9,360 features (84.1% sparse) and 4 docvars.
                   features
    docs            fellow-citizens  of the senate and house representatives : among vicissitudes
      1841-Harrison              11 604 829      5 231     1               4 1     3            0