Search code examples
rdataframetext-miningquantedatopicmodels

Error in LDA(cdes, k = K, method = "Gibbs", control = list(verbose = 25L, : Each row of the input matrix needs to contain at least one non-zero entry


I have a big dataset of almost 90 columns and about 200k observations. One of the column contains descriptions, so it's only text. However, i have like 100 descriptions that are NAs.

I tried the code of Pablo Barbera from GitHub concerning Topic Models because i need it.

OUTPUT

library(topicmodels)
library(quanteda)

des <- subset(finalMSI, !is.na(description), select=c(description))
corpus_des <- corpus(des$description)
df_des <- dfm(corpus_des, remove=stopwords("spanish"), verbose=TRUE,
              remove_punct=TRUE, remove_numbers=TRUE)
cdes <- dfm_trim(df_des, min_docfreq = 2)

# estimate LDA with K topics
K <- 20
lda <- LDA(cdes, k = K, method = "Gibbs", 
           control = list(verbose=25L, seed = 123, burnin = 100, iter = 500))

Error in LDA(cdes, k = K, method = "Gibbs", control = list(verbose = 25L, : Each row of the input matrix needs to contain at least one non-zero entry

As i don't have any NA in my subset, i don't understand this error message (it's my first time using this package)


Solution

  • It looks like some of your documents are empty, in the sense that they contain no counts of any feature.

    You can remove them with:

    cdes <- dfm_trim(df_des, min_docfreq = 2) %>%
       dfm_subset(ntoken(cdes) > 0)