I have a 370MB zip file and the content is a 4.2GB csv file.
I did:
unzip("year2015.zip", exdir = "csv_folder")
And I got this message:
1: In unzip("year2015.zip", exdir = "csv_folder") :
possible truncation of >= 4GB file
Have you experienced that before? How did you solve it?
I agree with @Sixiang.Hu's answer, R's unzip() won't work reliably with files greater than 4GB.
To get at how did you solve it?: I've tried a few different tricks with it, and in my experience the result of anything using R's built-ins is (almost) invariably an incorrect identification of the end-of-file (EOF) marker before the actual end of the file.
I deal with this issue in a set of files I process on a nightly basis, and to deal with it consistently and in an automated fashion, I wrote the function below to wrap the UNIX unzip. This is basically what you're doing with system(unzip()), but gives you a bit more flexibility in its behavior, and allows you to check for errors more systematically.
decompress_file <- function(directory, file, .file_cache = FALSE) {
if (.file_cache == TRUE) {
print("decompression skipped")
} else {
# Set working directory for decompression
# simplifies unzip directory location behavior
wd <- getwd()
setwd(directory)
# Run decompression
decompression <-
system2("unzip",
args = c("-o", # include override flag
file),
stdout = TRUE)
# uncomment to delete archive once decompressed
# file.remove(file)
# Reset working directory
setwd(wd); rm(wd)
# Test for success criteria
# change the search depending on
# your implementation
if (grepl("Warning message", tail(decompression, 1))) {
print(decompression)
}
}
}
Notes:
The function does a few things, which I like and recommend:
system2
over system because the documentation says "system2 is a more portable and flexible interface than system" directory
and file
arguments, and moves the working directory to the directory
argument; depending on your system, unzip (or your choice of decompression tool) gets really finicky about decompressing archives outside the working directory
.file_cache
argument which allows you to skip decompression
if
+ grepl
check at the end looks for warnings in the stdout, and prints the stdout if it finds that expression