I have a big dataset of almost 90 columns and about 200k observations. One of the column contains descriptions, so it's only text. However, i have like 100 descriptions that are NAs.
I tried the code of Pablo Barbera from GitHub concerning Topic Models because i need it.
OUTPUT
library(topicmodels)
library(quanteda)
des <- subset(finalMSI, !is.na(description), select=c(description))
corpus_des <- corpus(des$description)
df_des <- dfm(corpus_des, remove=stopwords("spanish"), verbose=TRUE,
remove_punct=TRUE, remove_numbers=TRUE)
cdes <- dfm_trim(df_des, min_docfreq = 2)
# estimate LDA with K topics
K <- 20
lda <- LDA(cdes, k = K, method = "Gibbs",
control = list(verbose=25L, seed = 123, burnin = 100, iter = 500))
Error in LDA(cdes, k = K, method = "Gibbs", control = list(verbose = 25L, : Each row of the input matrix needs to contain at least one non-zero entry
As i don't have any NA in my subset, i don't understand this error message (it's my first time using this package)
It looks like some of your documents are empty, in the sense that they contain no counts of any feature.
You can remove them with:
cdes <- dfm_trim(df_des, min_docfreq = 2) %>%
dfm_subset(ntoken(cdes) > 0)