Search code examples
rxlsxpassword-protection

Importing a password protected xlsx file into R


I found an old thread (How do you read a password protected excel file into r?) that recommended that I use the following code to read in a password protected file:

install.packages("excel.link")

library("excel.link")

dat <- xl.read.file("TestWorkbook.xlsx", password = "pass", write.res.password="pass")

dat

However, when I try to do this my R immediately crashes. I've tried removing the write.res.password argument, and that doesn't seem to be the issue. I have a hunch that excel.link might not work with the newest version of R, so if you know of any other ways to do this I'd appreciate the advice.

EDIT: Using read.xlsx generates this error:

    Error in .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class),  : 
     
    org.apache.poi.poifs.filesystem.OfficeXmlFileException: 
The supplied data appears to be in the Office 2007+ XML. 
You are calling the part of POI that deals with OLE2 Office Documents. 
You need to call a different part of POI to process this data (eg XSSF instead of HSSF)

Solution

  • You can remove the password of the excel file without knowing it with the following function (adapted version of code available at https://www.r-bloggers.com/2018/05/remove-password-protection-from-excel-sheets-using-r/)

    remove_Password_Protection_From_Excel_File <- function(dir, file, bool_XLSXM = FALSE)
    {
      initial_Dir <- getwd()
      setwd(dir)
    
      # file name and path after removing protection
      if(bool_XLSXM == TRUE)
      {
        file_unlocked <- stringr::str_replace(basename(file), ".xlsm$", "_unlocked.xlsm")
    
      }else
      {
        file_unlocked <- stringr::str_replace(basename(file), ".xlsx$", "_unlocked.xlsx")
      }
    
      file_unlocked_path <- file.path(dir, file_unlocked)
    
      # create temporary directory in project folder
      # so we see what is going on
      temp_dir <- "_tmp"
    
      # remove and recreate _tmp folder in case it already exists
      unlink(temp_dir, recursive = TRUE)
      dir.create(temp_dir)
    
      # unzip Excel file into temp folder
      unzip(file, exdir = temp_dir)
    
      # get full path to XML files for all worksheets
      worksheet_paths <- list.files(paste0(temp_dir, "/xl/worksheets"), full.name = TRUE, pattern = ".xml")
    
      # remove the XML node which contains the sheet protection
      # We might of course use e.g. xml2 to parse the XML file, but this simple approach will suffice here
      for(ws in worksheet_paths)
      {
        file_Content <- readLines(ws, encoding = "windows1")
    
        # the "sheetProtection" node contains the hashed password "<sheetProtection SOME INFO />"
        # we simply remove the whole node
        out <- str_replace(file_Content, "<sheetProtection.*?/>", "")
        writeLines(out, ws)
      }
    
      worksheet_Protection_Paths <- paste0(temp_dir, "/xl/workbook.xml")
      file_Content <- readLines(worksheet_Protection_Paths , encoding = "windows1")
      out <- stringr::str_replace(file_Content, "<workbookProtection.*?/>", "")
      writeLines(out, worksheet_Protection_Paths)
    
      # create a new zip, i.e. Excel file, containing the modified XML files
      old_wd <- setwd(temp_dir)
      files <- list.files(recursive = T, full.names = F, all.files = T, no.. = T)
    
      # as the Excel file is a zip file, we can directly replace the .zip extension by .xlsx
      zip::zip(file_unlocked_path, files = files) # utils::zip does not work for some reason
      setwd(old_wd)
    
      # clean up and remove temporary directory
      unlink(temp_dir, recursive = T)
      setwd(initial_Dir)
    }
    

    Once the password is removed, you can read the Excel file. This approach works for me.