Search code examples
rcorpusquanteda

Converting Quanteda dfm to stm


I convert a tm corpus into a quanteda corpus. I apply dfm. Then I convert the dfm into stm format. This code was working just fine till 15 minutes ago; all I did was add some more words to be removed into a custom list (myRMlist). I'm baffled. Any suggestions?

data(tmCorpus, package = "tm") 
Qcorpus <- corpus(tmCorpus)
summary(Qcorpus, showmeta=TRUE)

myRMlist <- readLines("myremovelist2.txt", encoding = "UTF-8")
Qcorpus.dfm <- dfm(Qcorpus, remove = myRMlist ) 
Qcorpus.dfm <- dfm(Qcorpus.dfm, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove = stopwords("en"), stem = FALSE)
Qcorpus.dfm <- dfm(Qcorpus.dfm, remove = stopwords(("es")))
Qcorpus.stm <- convert(Qcorpus.dfm, to = "stm")

Error in convert(Qcorpus.dfm, to = "stm") : unused argument (to = "stm")

Solution

  • It's hard to reproduce your error since I don't have all of the inputs, but I tried recreating a set of custom words to remove, and it all worked for me.

    But there are better ways to do what you are trying to do, which I list here.

    First, for me, the conversion worked. But there are better ways to get there: first, create the tokens object, with your word list removals, then construct the dfm. And then, convert to the stm format.

    library("quanteda", warn.conflicts = FALSE)
    ## Package version: 2.0.2
    ## Parallel computing: 2 of 8 threads used.
    ## See https://quanteda.io for tutorials and examples.
    
    # set up data
    data(crude, package = "tm")
    Qcorpus <- corpus(crude)
    # simulate words to remove, not supplied
    myRMlist <- readLines(textConnection(c("and", "or", "but", "of")))
    
    # conversion works
    stm_input_stm <- Qcorpus %>%
      tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE) %>%
      tokens_remove(pattern = c(myRMlist, stopwords("en"))) %>%
      dfm() %>%
      convert(to = "stm")
    

    However there is no need to convert with stm, since stm::stm() can take a dfm as input directly:

    # stm can take a dfm directly
    stm_input_dfm <- Qcorpus %>%
      tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE) %>%
      tokens_remove(pattern = c(myRMlist, stopwords("en"))) %>%
      dfm()
    
    library("stm")
    ## stm v1.3.5 successfully loaded. See ?stm for help. 
    ##  Papers, resources, and other materials at structuraltopicmodel.com
    
    stm(stm_input_dfm, K = 5)
    ## Beginning Spectral Initialization 
    ##   Calculating the gram matrix...
    ##   Finding anchor words...
    ##      .....
    ##   Recovering initialization...
    ##      .........
    ## Initialization complete.
    ## ....................
    ## Completed E-Step (0 seconds). 
    ## Completed M-Step. 
    ## Completing Iteration 1 (approx. per word bound = -6.022) 
    ## ....................
    ## Completed E-Step (0 seconds). 
    ## Completed M-Step. 
    ## Completing Iteration 2 (approx. per word bound = -5.480, relative change = 9.000e-02) 
    ## ....................
    ## Completed E-Step (0 seconds). 
    ## Completed M-Step. 
    ## Completing Iteration 3 (approx. per word bound = -5.386, relative change = 1.708e-02) 
    ## ....................
    ## Completed E-Step (0 seconds). 
    ## Completed M-Step. 
    ## Completing Iteration 4 (approx. per word bound = -5.370, relative change = 2.987e-03) 
    ## ....................
    ## Completed E-Step (0 seconds). 
    ## Completed M-Step. 
    ## Completing Iteration 5 (approx. per word bound = -5.367, relative change = 6.841e-04) 
    ## Topic 1: said, mln, oil, last, billion 
    ##  Topic 2: oil, dlrs, said, crude, price 
    ##  Topic 3: oil, said, power, ship, crude 
    ##  Topic 4: oil, opec, said, prices, market 
    ##  Topic 5: oil, said, one, futures, mln 
    ## ....................
    ## Completed E-Step (0 seconds). 
    ## Completed M-Step. 
    ## Completing Iteration 6 (approx. per word bound = -5.366, relative change = 1.601e-04) 
    ## ....................
    ## Completed E-Step (0 seconds). 
    ## Completed M-Step. 
    ## Completing Iteration 7 (approx. per word bound = -5.366, relative change = 5.444e-05) 
    ## ....................
    ## Completed E-Step (0 seconds). 
    ## Completed M-Step. 
    ## Completing Iteration 8 (approx. per word bound = -5.365, relative change = 1.856e-05) 
    ## ....................
    ## Completed E-Step (0 seconds). 
    ## Completed M-Step. 
    ## Model Converged
    ## A topic model with 5 topics, 20 documents and a 971 word dictionary.