Search code examples
rquantedadata-preprocessing

Backtransform word tokens to a sentence-based corpus in Quanteda after preprocessing


I want to preprocess my text data using the {quanteda} package in R. To do so, I am creating a corpus, which is then tokenized and preprocessed (e.g. lowercase, remove punctuation, etc.).

Ideally, I would then want to restore the initial sentence structure of the corpus, whilst keeping the document variables, because I am following a string-of-words approach in the analysis.

# Create an example corpus.
my_corpus <- corpus(c("This is a sentence. \n\nThis is another sentence.", 
                      "This is the first sentence of the second document.",
                      "This is yet another ... ••• *** sentence."))

# Set docvars.
docvars(my_corpus) <- data.frame(doc_id = 1:3, author = c("A", "B", "C"))

# Three documents and four sentences.
ndoc(my_corpus)
nsentence(my_corpus)

# Tokenize and preprocess.
my_tokens <- my_corpus %>%
  tokens(remove_punct = T) %>%
  tokens_tolower()

my_tokens

# Docvars are still present.
docvars(my_tokens)

I could then simply do the following to restore the sentence structure. However, in the process of doing so, I would lose my docvars:

# Back-transform to sentences.
my_corpus.clean <- vapply(my_tokens, paste, collapse = " ", character(1)) %>% corpus()

# Docvars are lost.
docvars(my_corpus.clean)

The preprocessing worked and so did restoring the sentence structure, but I no longer have my docvars. I could then add them back to the new corpus object (docvars(...) <- ...), but am afraid that the docvars values will no longer correspond to the right documents.

Is there a way to transform the tokens object back to a sentence-based object that avoids losing the docvars?


Solution

  • Try this at the end:

    # back-transform to sentences.
    my_corpus.clean <- vapply(my_tokens, paste, collapse = " ", character(1)) |>
        corpus(docvars = docvars(my_tokens))
    
    # docvars are present
    my_corpus.clean
    #> Corpus consisting of 3 documents and 2 docvars.
    #> text1 :
    #> "this is a sentence this is another sentence"
    #> 
    #> text2 :
    #> "this is the first sentence of the second document"
    #> 
    #> text3 :
    #> "this is yet another sentence"
    
    docvars(my_corpus.clean)
    #>   doc_id author
    #> 1      1      A
    #> 2      2      B
    #> 3      3      C