Search code examples
rtmcorpus

Corpus reading from pdf OR text in R


I have a large list of files I want to read in R as a Corpus. All of the files were pdf, but recently, I realize some of them will be txt.

Before I had the text files, I was simply creating a list of pdf files that are in the directory and reading them using the Corpus function with readerControl:

getwd()
files <- list.files(pattern = "pdf$")
corp <- Corpus(URISource(files),
               readerControl = list(reader = readPDF))

I´ve tried to create a combined list of pdfs and txts, but I can´t find a way to use the readerContrl for pdf or txt:

files1 <- list.files(pattern = "pdf$")
files2 <- list.files(pattern = "txt$")
files<-c(files1,files2)

corp <- Corpus(URISource(files),
               readerControl = list(reader = c(readPDF,readPlain)))

Any ideas on how to solve this issue? I thought about merging two Copuses elements, one that reader=readPDF, another that reader=readPlain. But since I am new to text mining, I am not sure what is the best practice to do that.


Solution

  • Do it the easier way using the readtext package. If your mix of .txt and .pdf files are in the same subdirectory, call this path_to_your_files/, then you can read them all in and then make them into a tm Corpus using readtext(). This function automagically recognises different input file types and converts them into UTF-8 text for text analysis in R. (The rtext object created here is a special type of data.frame that includes a document identifier column and a column called text that contains the converted text contents of your input documents.)

    rtext <- readtext::readtext("path_to_your_files/*")
    tm::Corpus(VectorSource(rtext[["text"]]))
    

    readtext objects can also be used directly with the quanteda package as inputs to quanteda::corpus() if you wanted to try an alternative to tm.