Search code examples
rmatrixtmtopic-modelingtext-analysis

STM: how to keep metadata when converting from tm to stm document-term matrix?


I'm trying to run structural topic models (using stm package) on the document-term matrix that was prepared using tm package.

I built a corpus in tm package that contains the following metadata:

library(tm)

myReader2 <- readTabular(mapping=list(content="text", id="id", sentiment = "sentiment"))
text_corpus2 <- VCorpus(DataframeSource(bin_stm_df), readerControl = list(reader = myReader2))

meta(text_corpus2[[1]])
  id       : 11
  sentiment: negative
  language : en

After doing some text-cleaning and saving the results as clean_corpus2(metadata still present), I change it to document-term matrix and then read it as stm-compatible matrix:

library(stm)

chat_DTM2 <- DocumentTermMatrix(clean_corpus2, control = list(wordLengths = c(3, Inf)))
DTM2 <- removeSparseTerms(chat_DTM2 , 0.990)
DTM_st <-readCorpus(DTM2, type = "slam")

So far, so good. However, when I try to specify the metadata using stm-compatible data, the metadata is gone:

docsTM <- DTM_st$documents # works fine
vocabTM <- DTM_st$vocab # works fine
metaTM <- DTM_st$meta # returns NULL

> metaTM
NULL

How do I keep the metadata from tm-generated Corpus in stm-compatible document-term matrix? Any suggestions welcome, thanks.


Solution

  • How about trying the quanteda package?

    Without the ability to access your object, I cannot guarantee this works verbatim, but it should:

    library("quanteda")
    
    # creates the corpus with document variables except for the "text"
    text_corpus3 <- corpus(bin_stm_df, text_field = "text")
    
    # convert to document-feature matrix - cleaning options can be added
    # see ?tokens
    chat_DTM3 <- dfm(text_corpus3)
    
    # similar to tm::removeSparseTerms()
    DTM3 <- dfm_trim(chat_DTM3, sparsity = 0.990)
    
    # convert to STM format
    DTM_st <- convert(DTM3, to = "stm")
    
    # then it's all there
    docsTM <- DTM_st$documents 
    vocabTM <- DTM_st$vocab    
    metaTM <- DTM_st$meta      # should return the data.frame of document variables