Search code examples
rquanteda

How to convert a tokens object into a corpus object


I have a corpus object that I converted into a tokens object. I then filtered this object to remove words and unify their spelling. For my further workflow, I again need a corpus object. How can I construct this from the tokens object?


Solution

  • You could paste the tokens together to return a new corpus. (Although this may not be the best approach if your goal is to get back to a corpus so that you can use corpus_reshape().)

    library("quanteda")
    ## Package version: 3.1.0
    ## Unicode version: 13.0
    ## ICU version: 69.1
    ## Parallel computing: 12 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    
    txt <- c(
      "This is an example.",
      "This, a second example."
    )
    
    corp <- corpus(txt)
    
    toks <- tokens(corp) %>%
      tokens_remove(stopwords("en"))
    toks
    ## Tokens consisting of 2 documents.
    ## text1 :
    ## [1] "example" "."      
    ## 
    ## text2 :
    ## [1] ","       "second"  "example" "."
    
    vapply(toks, paste, FUN.VALUE = character(1), collapse = " ") %>%
      corpus()
    ## Corpus consisting of 2 documents.
    ## text1 :
    ## "example ."
    ## 
    ## text2 :
    ## ", second example ."