Search code examples
rmemorylapplytext-miningcorpus

Memory problems when using lapply for corpus creation


My eventual goal is to transform thousands of pdfs into a corpus / document term matrix to conduct some topic modeling. I am using the pdftools package to import my pdfs and work with the tm package for preparing my data for text mining. I managed to import and transform one individual pdf, like this:

txt <- pdf_text("pdfexample.pdf")

#create corpus
txt_corpus <- Corpus(VectorSource(txt))

# Some basic text prep, with tm_map(), like:
txt_corpus <- tm_map(txt_corpus, tolower)

# create document term matrix
dtm <- DocumentTermMatrix(txt_corpus)

However, I am completely stuck with automating this process and I have only limited experience with either loops or apply functions. My approach has run into memory problems, when converting the raw pdf_text() output into a corpus, even though I tested my code only with 5 pdf files (total: 1.5MB). R tried to allocate a vector of more than half GB. Which seems absolutely not right to me. My attempt looks like this:

# Create a list of all pdf paths
file_list <- list.files(path = "mydirectory",
                 full.names = TRUE,
                 pattern = "name*", # to import only specific pdfs
                 ignore.case = FALSE)

# Run a function that reads the pdf of each of those files:
all_files <- lapply(file_list, FUN = function(files) {
             pdf_text(files)
             })

all_files_corpus = lapply(all_files,
                          FUN = Corpus(DirSource())) # That's where I run into memory issues

Am I doing something fundamentally wrong? Not sure if it is just a mere memory issue or whether there are easier approaches to my problem. At least, from what I gathered, lapply should be a lot more memory efficient then looping. But maybe there is more to it. I've tried to solve it by my own for days now, but nothing worked.

Grateful for any advice/hint on how to proceed!

Edit: I tried to execute the lapply with only one pdf and my R crashed again, even though I have no capacity problems at all, when using the code mentioned first.


Solution

  • You can write a function which has series of steps that you want to execute on each pdf.

    pdf_to_dtm <- function(file) {
      txt <- pdf_text(file)
      #create corpus
      txt_corpus <- Corpus(VectorSource(txt))
      # Some basic text prep, with tm_map(), like:
      txt_corpus <- tm_map(txt_corpus, tolower)
      # create document term matrix
      dtm <- DocumentTermMatrix(txt_corpus)
      dtm
    }
    

    Using lapply apply the function on each file

    file_list <- list.files(path = "mydirectory",
                     full.names = TRUE,
                     pattern = "name*", # to import only specific pdfs
                     ignore.case = FALSE)
    
    all_files_corpus <- lapply(file_list, pdf_to_dtm)