Search code examples
rquantedatopicmodels

Convert processed format with stm into dtm (Structural topic modeling)


I have used the textProcessor and the prepDocuments functions from the stm package to clean a corpus. Now I would like to convert the resulting object (list of indices plus vocabulary) into a standard document-term matrix (or quanteda document-feature matrix) so that I can apply topicmodels function LDA and compare the resulting topics with stm.

processed <- textProcessor(poliblog5k.docs,
                           metadata = poliblog5k.meta,
                           language = "en")

prepped <- prepDocuments(processed$documents,
                         processed$vocab,
                         processed$meta,
                         lower.thresh = 20)

LDA(processed)
LDA(prepped)

> Error in x != vector(typeof(x), 1L)

LDA(processed$documents)
LDA(prepped$documents)

> Error in !all.equal(x$v, as.integer(x$v)) 

Solution

  • I had the same problem. What I did is to transform the output from prepDocuments to a one-term-per-document-per-row format and then apply the cast_dfm function from the package {tidytext}.

    library(topicmodels)
    library(tidyverse)
    library(tidytext)
    library(magrittr)
    library(stm)
    
    stm_to_dtm <- function(out){
      tibble(out_doc = out$documents %>% map(t)) %>%
        mutate(out_doc = out_doc %>% map(set_colnames, c("term", "n"))) %>% 
        mutate(out_doc = out_doc %>% map(as_tibble)) %>% 
        rownames_to_column(var = "document") %>% 
        unnest(cols = out_doc) %>% 
        mutate(term = out$vocab[term]) %>% 
        cast_dtm(document, term, n)
    }
    
    temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
    meta<-temp$meta
    vocab<-temp$vocab
    docs<-temp$documents
    out <- prepDocuments(docs, vocab, meta)
    
    prepped <- stm_to_dtm(out)
    
    > prepped
    <<DocumentTermMatrix (documents: 341, terms: 462)>>
    Non-/sparse entries: 3149/154393
    Sparsity           : 98%
    Maximal term length: 11
    Weighting          : term frequency (tf)
    
    > LDA(prepped, k = 5)
    A LDA_VEM topic model with 5 topics.