I am trying to clean some text data, and after tokenising and e.g. removing punctuation, I want my transform the token object into a vector/dataframe/corpus.
My current approach is:
library(quanteda)
library(dplyr)
raw <- c("This is text #1.", "And a second document...")
tokens <- raw %>% tokens(remove_punct = T)
docs <- lapply(tokens, toString) %>% gsub(pattern = ",", replacement = "")
Is there a more "quanteda" or at least a simpler way to do this?
This would be how I would do it, and it preserves the docnames as element names in your output vector. (But you can add USE.NAMES = FALSE
if you don't want to keep them.)
> sapply(tokens, function(x) paste(as.character(x), collapse = " "))
text1 text2
"This is text #1" "And a second document"
You don't need the library(dplyr)
here.