Search code examples
rnlpquanteda

Is there a simple way to reshape a token object to documents in quanteda?


I am trying to clean some text data, and after tokenising and e.g. removing punctuation, I want my transform the token object into a vector/dataframe/corpus.

My current approach is:

library(quanteda)
library(dplyr)

raw <- c("This is text #1.", "And a second document...")
tokens <- raw %>% tokens(remove_punct = T)
docs <- lapply(tokens, toString) %>% gsub(pattern = ",", replacement = "")

Is there a more "quanteda" or at least a simpler way to do this?


Solution

  • This would be how I would do it, and it preserves the docnames as element names in your output vector. (But you can add USE.NAMES = FALSE if you don't want to keep them.)

    > sapply(tokens, function(x) paste(as.character(x), collapse = " "))
                      text1                   text2 
          "This is text #1" "And a second document"
    

    You don't need the library(dplyr) here.