Search code examples
rnlpquanteda

how to read text files in quanteda, storing each line as a document


I have texts stored in several files.
Within the files each line is a document (text of a blog post, text of a tweet Etc.).
If I read using the readtext package in the default way shown in doc/examples the content of each file will be a single document instead of each line being a document.

My goal is to use a quanteda corpus, with each line stored as a document.
I am using readtext as it is a companion package to quanteda, but using readtext is not a strict requirement.

I would like to avoid manually splitting the originary files in smaller files each corresponding to a line.


Solution

  • Method 1: use readLines() in combination with list.files():

    txt <- character()
    for (f in list.files("your-folder")) {
       txt <- c(txt, readLines(f))
    }
    corp <- corpus(txt)
    

    Method 2: you can split lines in a corpus using corpus_segment():

    corp <- corpus(readtext("your-folder")) 
    corp_line <- corpus_segment(corp, "\n",  extract_pattern = FALSE, pattern_position = "after")