I have this code to fit a topic model with the R wrapper for MALLET:
docs <- mallet.import(DF$document, DF$text, stop_words)
mallet_model <- MalletLDA(num.topics = 4)
mallet_model$loadDocuments(docs)
mallet_model$train(100)
I have used the tm package to read my documents, which are txt files in a directory:
myCorpus <- Corpus(DirSource("data")) # a directory of txt files
The corpus can't be used as input for mallet.import
, so how do I get from the tm corpus myCorpus
above to the DF
to call upon?
You can use tidy data principles to process your text and get it ready for input into mallet, with one row per document, as described here.
Also, there are tidiers for the mallet package in tidytext, and you can use them to analyze the output of mallet topic modeling:
# word-topic pairs
tidy(mallet_model)
# document-topic pairs
tidy(mallet_model, matrix = "gamma")
# column needs to be named "term" for "augment"
term_counts <- rename(word_counts, term = word)
augment(mallet_model, term_counts)