I am attempting to scan many documents, with the purpose of reorganizing the text into a standard format. This involves either extracting the table using docxtractr
, and extracting the body text separately using textreadr
, or using officer::docx_summary
to label the body and table text for easier manipulation. For this problem, I'm using officer::read_docx
and officer::docx_summary
. The test documents I'm using are .docx
, and contain nonsense text before and after a table that includes text and numbers.
My code is:
dir <- "C:/path/to/documents"
filenames <- list.files(path = dir, pattern = "*.docx", full.names = TRUE)
docxtest <- officer::docx_summary(lapply(filenames, officer::read_docx))
Ideally it would produce a list of dataframes that contain the docx_summary
information. I tried to use lapply
on a list of filenames, but the output list gives an error when trying to view:
Error in names[[i]]: subscript out of bounds.
officer::docx_summary
is for an object returned by officer::read_docx
, it does not support list...
filenames <- list.files(path = dir, pattern = "*.docx", full.names = TRUE)
docxtest <- lapply(filenames, function(x) officer::docx_summary(officer::read_docx(x)) )