Search code examples
runzip

R: possible truncation of >= 4GB file


I have a 370MB zip file and the content is a 4.2GB csv file.

I did:

unzip("year2015.zip", exdir = "csv_folder")

And I got this message:

1: In unzip("year2015.zip", exdir = "csv_folder") :
  possible truncation of >= 4GB file

Have you experienced that before? How did you solve it?


Solution

  • I agree with @Sixiang.Hu's answer, R's unzip() won't work reliably with files greater than 4GB.

    To get at how did you solve it?: I've tried a few different tricks with it, and in my experience the result of anything using R's built-ins is (almost) invariably an incorrect identification of the end-of-file (EOF) marker before the actual end of the file.

    I deal with this issue in a set of files I process on a nightly basis, and to deal with it consistently and in an automated fashion, I wrote the function below to wrap the UNIX unzip. This is basically what you're doing with system(unzip()), but gives you a bit more flexibility in its behavior, and allows you to check for errors more systematically.

    decompress_file <- function(directory, file, .file_cache = FALSE) {
    
        if (.file_cache == TRUE) {
           print("decompression skipped")
        } else {
    
          # Set working directory for decompression
          # simplifies unzip directory location behavior
          wd <- getwd()
          setwd(directory)
    
          # Run decompression
          decompression <-
            system2("unzip",
                    args = c("-o", # include override flag
                             file),
                    stdout = TRUE)
    
          # uncomment to delete archive once decompressed
          # file.remove(file) 
    
          # Reset working directory
          setwd(wd); rm(wd)
    
          # Test for success criteria
          # change the search depending on 
          # your implementation
          if (grepl("Warning message", tail(decompression, 1))) {
            print(decompression)
          }
        }
    }    
    

    Notes:

    The function does a few things, which I like and recommend:

    • uses system2 over system because the documentation says "system2 is a more portable and flexible interface than system"
    • separates the directory and file arguments, and moves the working directory to the directory argument; depending on your system, unzip (or your choice of decompression tool) gets really finicky about decompressing archives outside the working directory
      • it's not pure, but resetting the working directory is a nice step toward the function having fewer side effects
      • you can technically do it without this, but in my experience it's easier to make the function more verbose than have to deal with generating filepaths and remembering unzip CLI flags
    • I set it to use the -o flag to automatically overwrite when rerun, but you could supply any number of arguments
    • includes a .file_cache argument which allows you to skip decompression
      • this comes in handy if you're testing a process which runs on the decompressed file, since 4GB+ files tend to take some time to decompress
    • commented out in this instance, but if you know you don't need the archive after decompressing, you can remove it inline
    • the system2 command redirects the stdout to decompression, a character vector
      • an if + grepl check at the end looks for warnings in the stdout, and prints the stdout if it finds that expression