Search code examples
tmmallettidytext

Reading documents with r-tm to use with r-mallet


I have this code to fit a topic model with the R wrapper for MALLET:

docs <- mallet.import(DF$document, DF$text, stop_words)

mallet_model <- MalletLDA(num.topics = 4)
mallet_model$loadDocuments(docs)
mallet_model$train(100)

I have used the tm package to read my documents, which are txt files in a directory:

myCorpus <- Corpus(DirSource("data")) # a directory of txt files

The corpus can't be used as input for mallet.import, so how do I get from the tm corpus myCorpus above to the DF to call upon?


Solution

  • You can use tidy data principles to process your text and get it ready for input into mallet, with one row per document, as described here.

    Also, there are tidiers for the mallet package in tidytext, and you can use them to analyze the output of mallet topic modeling:

    # word-topic pairs
    tidy(mallet_model)
    
    # document-topic pairs
    tidy(mallet_model, matrix = "gamma")
    
    # column needs to be named "term" for "augment"
    term_counts <- rename(word_counts, term = word)
    augment(mallet_model, term_counts)