Search code examples
rloopslapplydocxofficer

How to apply officer::read_docx to whole folder


I am attempting to scan many documents, with the purpose of reorganizing the text into a standard format. This involves either extracting the table using docxtractr, and extracting the body text separately using textreadr, or using officer::docx_summary to label the body and table text for easier manipulation. For this problem, I'm using officer::read_docx and officer::docx_summary. The test documents I'm using are .docx, and contain nonsense text before and after a table that includes text and numbers.

My code is:

dir <- "C:/path/to/documents"
filenames <- list.files(path = dir, pattern = "*.docx", full.names = TRUE)
docxtest <- officer::docx_summary(lapply(filenames, officer::read_docx))

Ideally it would produce a list of dataframes that contain the docx_summary information. I tried to use lapply on a list of filenames, but the output list gives an error when trying to view:

Error in names[[i]]: subscript out of bounds.

Solution

  • officer::docx_summary is for an object returned by officer::read_docx, it does not support list...

    filenames <- list.files(path = dir, pattern = "*.docx", full.names = TRUE)
    docxtest <- lapply(filenames, function(x) officer::docx_summary(officer::read_docx(x)) )