Search code examples
rregexgsubquanteda

Replacing a character with \n in a regex then turning the text into a quanteda corpus


I have some text I have OCR'd. The OCR put a lot of newlines (\n) were they were not supposed to be. But also missed a lot of new lines that were supposed to be there.

I want to remove the existing newlines and replace them with spaces. Then replace specific characters with newlines in the raw text. Then turn the documents into a corpus in quanteda.

I can create a basic corpus. But the trouble is I can't then break it up into paragraphs. If I use
corpus_reshape(corps, to ="paragraphs", use_docvars = TRUE) It will not break up the document.

If I use corpus_segment(corps, pattern = "\n")

I get an error.

rm(list=ls(all=TRUE))
library(quanteda)
library(readtext)

# Here is a sample Text
sample <- "Hello my name is Christ-
ina. 50 Sometimes we get some we-


irdness

Hello my name is Michael, 
sometimes we get some weird,


 and odd, results-- 50 I want to replace the 
 50s
"



# Removing the existing breaks
sample <- gsub("\n", " ", sample)
sample <- gsub(" {2,}", " ", sample)
# Adding new breaks
sample <- gsub("50", "\n", sample)

# I can create a corpus
corps <- corpus(sample, compress = FALSE)
summary(corps, 1)

# But I can't change to paragraphs
corp_para <- corpus_reshape(corps, to ="paragraphs", use_docvars = TRUE)
summary(corp_para, 1)

# But I can't change to paragraphs
corp_para <- corpus_reshape(corps, to ="paragraphs", use_docvars = TRUE)
summary(corp_para, 1)

corp_segmented <-  corpus_segment(corps, pattern = "\n")

# The \n characters are in both documents.... 
corp_para$documents$texts
sample

Solution

  • I recommend using regular expression replacement to clean your text before making it into a corpus. The trick in your text is figure out where you want to remove newlines, and where you want to keep them. I'm guessing from your question that you want to remove the occurrences of "50", but also probably join the words split by hyphens and a newline. You probably also want to keep two newlines between texts?

    Many users prefer the simpler interface of the stringr package, but I've always tended to use stringi (on which stringr is built) instead. It allows for vectorized replacement, so you can feed it a vector of patterns to match, and the replacements, in one function call.

    library("stringi")
    
    sample2 <- stri_replace_all_regex(sample, c("\\-\\n+", "\\n+", "50"), c("", "\n", "\n"),
      vectorize_all = FALSE
    )
    cat(sample2)
    ## Hello my name is Christina. 
    ##  Sometimes we get some weirdness
    ## Hello my name is Michael, 
    ## sometimes we get some weird,
    ##  and odd, results-- 
    ##  I want to replace the 
    ##  
    ## s
    

    Here, you match "\\n" as a regular expression pattern but use just "\n" as the (literal) replacement.

    There are two newlines before the last "s" in the replaced text because a) there was already one after the "s" in "50s" and b) we added one by replacing 50 with a new "\n".

    Now you can construct a corpus with quanteda::corpus(sample2).