Search code examples
rdataframedocx

Write a function in R to process docx files


I have a folder that contains *.docx files. I want to convert the script below into some sort of a loop function to read all docx files but I really dont know how to write R function and someone please guide me?

library(docxtractr)
real_world <- read_docx("C:/folder/doc1.docx")
docx_tbl_count(real_world)
tbls <- docx_extract_all_tbls(real_world)
a <- as.data.frame(tbls)

So ideally it appends new table everytime a new document is extracted.

Thanks Peddie


Solution

  • Edit: I assumed for this answer that the term "function" was not used in the sense of an R function by OP. I think OP means just an algorithm to solve the problem.

    #### load packages ####
    library(docxtractr)
    library(plyr)
    
    #### load data ####
    # define path of dir
    pathto <- "stackoverflow/41251392/example/"
    # get path of every .docx-file in dir
    filelist <- list.files(path = pathto, pattern = "*.docx", full.names = TRUE)
    # read every file with docxtractr::read_docx()
    tablelist <- lapply(filelist, read_docx)
    # extract every table from every file with docxtractr::docx_extract_all_tbls()
    tables <- lapply(tablelist, docx_extract_all_tbls)
    
    #### append data to create one data.frame #### 
    # combine extracted tables with plyr::ldply()
    ldply(lapply(tables, function(x) {ldply(x, data.frame)}), data.frame)
    

    The last line is a bit difficult to understand. Take a look at ?plyr::ldply.