Backtransform word tokens to a sentence-based corpus in Quanteda after preprocessing

I want to preprocess my text data using the {quanteda} package in R. To do so, I am creating a corpus, which is then tokenized and preprocessed (e.g. lowercase, remove punctuation, etc.).

Ideally, I would then want to restore the initial sentence structure of the corpus, whilst keeping the document variables, because I am following a string-of-words approach in the analysis.

# Create an example corpus.
my_corpus <- corpus(c("This is a sentence. \n\nThis is another sentence.", 
                      "This is the first sentence of the second document.",
                      "This is yet another ... ••• *** sentence."))

# Set docvars.
docvars(my_corpus) <- data.frame(doc_id = 1:3, author = c("A", "B", "C"))

# Three documents and four sentences.
ndoc(my_corpus)
nsentence(my_corpus)

# Tokenize and preprocess.
my_tokens <- my_corpus %>%
  tokens(remove_punct = T) %>%
  tokens_tolower()

my_tokens

# Docvars are still present.
docvars(my_tokens)

I could then simply do the following to restore the sentence structure. However, in the process of doing so, I would lose my docvars:

# Back-transform to sentences.
my_corpus.clean <- vapply(my_tokens, paste, collapse = " ", character(1)) %>% corpus()

# Docvars are lost.
docvars(my_corpus.clean)

The preprocessing worked and so did restoring the sentence structure, but I no longer have my docvars. I could then add them back to the new corpus object (docvars(...) <- ...), but am afraid that the docvars values will no longer correspond to the right documents.

Is there a way to transform the tokens object back to a sentence-based object that avoids losing the docvars?

Solution

Try this at the end:

# back-transform to sentences.
my_corpus.clean <- vapply(my_tokens, paste, collapse = " ", character(1)) |>
    corpus(docvars = docvars(my_tokens))

# docvars are present
my_corpus.clean
#> Corpus consisting of 3 documents and 2 docvars.
#> text1 :
#> "this is a sentence this is another sentence"
#> 
#> text2 :
#> "this is the first sentence of the second document"
#> 
#> text3 :
#> "this is yet another sentence"

docvars(my_corpus.clean)
#>   doc_id author
#> 1      1      A
#> 2      2      B
#> 3      3      C