Search code examples
rquantedaread-text

Ignore errors in readtext r


I am now trying to extract a large number of docx files (1500) placed in one folder, using readtext (after creating a list using list.files)

You can find similar examples here: https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html

I am getting errors with some files (examples below), the problem is when this error occurs, the extraction process is stopped. I can identify the problematic file, by changing verbosity = 3, but then I have to restart the extraction process (to find another problematic file(s)).

My question is if there is a way to avoid interrupting the process if an error is encountered?

I change ignore_missing_files = TRUE but this did not fix the problem.

examples for the errors encountered:

write error in extracting from zip file
Error: 'C:\Users--- c/word/document.xml' does not exist.

Sorry for not posting a reproducible example, but I do not know how to post an example with large docx files. But this is the code:

library(readtext)
 
data_files <- list.files(path = "PATH", full.names = T, recursive = T)   # PATH = the path to the folder where the documents are located
extracted_texts <- readtext(data_files, docvarsfrom = "filepaths", dvsep = "/", verbosity = 3, ignore_missing_files = TRUE) # this is to extract the text in the files

 
write.csv2(extracted_texts, file = "data/text_extracts.csv", fileEncoding = "UTF-8") # this is to export the files into csv


Solution

  • Let's first put together a reproducible example:

    download.file("https://file-examples-com.github.io/uploads/2017/02/file-sample_1MB.docx", "test1.docx")
    writeLines("", "test2.docx")
    

    The first file I produced here should be a proper docx file, the second one is rubbish.

    I would wrap readtext in a small function that deals with the errors and warnings:

    readtext_safe <- function(f) {
      out <- tryCatch(readtext::readtext(f), 
                      error = function(e) "fail",
                      warning = function(e) "fail")
      if (isTRUE("fail" == out)) {
        write(f, "errored_files.txt", append = TRUE)
      } else {
        return(out)
      }
    }
    

    Note that I treat errors and warning the same, which might not be what you actually want. We can use this function to loop through your files:

    files <- list.files(pattern = ".docx$", ignore.case = TRUE, full.names = TRUE)
    
    x <- lapply(files, readtext_safe)
    x
    #> [[1]]
    #> readtext object consisting of 1 document and 0 docvars.
    #> # Description: df[,2] [1 × 2]
    #>   doc_id     text               
    #>   <chr>      <chr>              
    #> 1 test1.docx "\"Lorem ipsu\"..."
    #> 
    #> [[2]]
    #> NULL
    

    In the resulting list, failed files simply have a NULL entry as nothing is returned. I like to write out a list of these errored files and the function above creates a txt file that looks like this:

    readLines("errored_files.txt")
    #> [1] "./test2.docx"