I have the following problem: I converted a corpus into a dfm and this dfmm has some zero entries that I need to remove before fitting a LDA model. I would usually do as follows:
OutDfm <- dfm_trim(dfm(corpus, tolower = TRUE, remove = c(stopwords("english"), stopwords("german"), stopwords("french"), stopwords("italian")), remove_punct = TRUE, remove_numbers = TRUE, remove_separators = TRUE, stem = TRUE, verbose = TRUE), min_docfreq = 5)
Creating a dfm from a corpus input...
... lowercasing
... found 272,912 documents, 112,588 features
... removed 613 features
... stemming features (English)
, trimmed 27491 feature variants
... created a 272,912 x 84,515 sparse dfm
... complete.
Elapsed time: 78.7 seconds.
# remove zero-entries
raw.sum=apply(OutDfm,1,FUN=sum)
which(raw.sum == 0)
OutDfm = OutDfm[raw.sum!=0,]
However, when I try to perform the last operations I get: Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
hinting at the fact the the matrix is too large to be manipulated.
Is there anyone who has met and solved this issue before? Any alternative strategy to remove the 0 entries?
Thanks a lot!
Your apply
with sum
transforms the dfm from a sparse matrix into a dense matrix for calculating the row sum.
Either use slam::row_sums
since slam functions work on sparse matrices, but better yet, just use quantada::dfm_subset
to select all the documents with more than 0 tokens.
dfm_subset(OutDfm, ntoken(OutDfm) > 0)
Example to show how it works with ntokens > 5000:
library(quanteda)
x <- corpus(data_corpus_inaugural)
x <- dfm(x)
x
Document-feature matrix of: 58 documents, 9,360 features (91.8% sparse) and 4 docvars.
features
docs fellow-citizens of the senate and house representatives : among vicissitudes
1789-Washington 1 71 116 1 48 2 2 1 1 1
# subset based on amount of tokens.
dfm_subset(x, ntoken(x) > 5000)
Document-feature matrix of: 3 documents, 9,360 features (84.1% sparse) and 4 docvars.
features
docs fellow-citizens of the senate and house representatives : among vicissitudes
1841-Harrison 11 604 829 5 231 1 4 1 3 0