The following code in R returns a data frame which includes filenames, unpacked length in bytes and the date (without extracting the files).
unzip(path_to_zip, list = T)
I am wondering how I can also extract the size of the packed (compressed) files or alternatively the compression ratio for each file.
I am using Windows 7 machine.
Thanks!
Using the unzip()
function, you cannot: by default, it uses an internal C function that does what it does and nothing more. However, it can use the external executable, and that does allow the verbose information (with -v
). In order to use this, you'll need to modify R's unzip()
function. The rest of this answer is an exercise of "use the source, luke", showing that the current functions can be extended when needed.
unzip2 <- function (zipfile, files = NULL, list = FALSE, list.verbose = FALSE, overwrite = TRUE,
junkpaths = FALSE, exdir = ".", unzip = "internal", setTimes = FALSE) {
if (identical(unzip, "internal")) {
if (!list && !missing(exdir))
dir.create(exdir, showWarnings = FALSE, recursive = TRUE)
res <- .External(utils:::C_unzip, zipfile, files, exdir, list,
overwrite, junkpaths, setTimes)
if (list) {
dates <- as.POSIXct(res[[3]], "%Y-%m-%d %H:%M", tz = "UTC")
data.frame(Name = res[[1]], Length = res[[2]], Date = dates,
stringsAsFactors = FALSE)
}
else invisible(attr(res, "extracted"))
}
else {
WINDOWS <- .Platform$OS.type == "windows"
if (!is.character(unzip) || length(unzip) != 1L || !nzchar(unzip))
stop("'unzip' must be a single character string")
zipfile <- path.expand(zipfile)
if (list) {
dashl <- if (list.verbose) "-lv" else "-l"
res <- if (WINDOWS)
system2(unzip, c(dashl, shQuote(zipfile)), stdout = TRUE)
else system2(unzip, c(dashl, shQuote(zipfile)), stdout = TRUE,
env = c("TZ=UTC"))
l <- length(res)
res2 <- res[-c(1, 3, l - 1, l)]
con <- textConnection(res2)
on.exit(close(con))
z <- read.table(con, header = TRUE, as.is = TRUE)
dt <- paste(z$Date, z$Time)
formats <- if (max(nchar(z$Date) > 8))
c("%Y-%m-%d", "%d-%m-%Y", "%m-%d-%Y")
else c("%m-%d-%y", "%d-%m-%y", "%y-%m-%d")
slash <- any(grepl("/", z$Date))
if (slash)
formats <- gsub("-", "/", formats)
formats <- paste(formats, "%H:%M")
for (f in formats) {
zz <- as.POSIXct(dt, tz = "UTC", format = f)
if (all(!is.na(zz)))
break
}
z[, "Date"] <- zz
z <- z[, colnames(z) != "Time"]
nms <- c("Name", "Length", "Date")
z[, c(nms, setdiff(colnames(z), nms))]
}
else {
args <- c("-oq", shQuote(zipfile))
if (length(files))
args <- c(args, shQuote(files))
if (exdir != ".")
args <- c(args, "-d", shQuote(exdir))
system2(unzip, args, stdout = NULL, stderr = NULL,
invisible = TRUE)
invisible(NULL)
}
}
}
In this, I modified lines: 1 (arguments), 6 (utils:::
), 21-25 (dashl
), 45 and added 46-47 (column choosing). The rest is from the original R unzip
function.
By default, unzip2
will behave exactly as unzip
, meaning it will not give you what you want. In order to get your desired results, you need to (a) tell it where you external unzip.exe
is located, and (b) tell it you want it to be verbose. (Feel free to modify the above definition to change the defaults.)
Note that on Windows, unzip.exe
is typically not installed by default. It is included in Rtools, Git-for-Windows, and msys2. You may need a little more effort to ensure Sys.which("unzip")
finds the executable.
This uses the (default) internal C function, meaning nothing more can come.
unzip2("~/bashdotfiles.zip", list = TRUE)
# Name Length Date
# 1 .bash_history 8269 2017-02-20 03:31:00
# 2 .bash_logout 220 2016-04-22 22:36:00
# 3 .bashrc 3771 2016-04-22 22:36:00
This uses the external executable, and is functionally identical (though notice the dates are different due to internal UTC conversion ... this could be fixed with a little more effort).
unzip2("~/bashdotfiles.zip", list = TRUE, unzip = Sys.which("unzip"))
# Name Length Date
# 1 .bash_history 8269 2017-02-20 11:31:00
# 2 .bash_logout 220 2016-04-23 05:36:00
# 3 .bashrc 3771 2016-04-23 05:36:00
Finally, the augmented listing:
unzip2("~/bashdotfiles.zip", list = TRUE, list.verbose = TRUE, unzip = Sys.which("unzip"))
# Name Length Date Method Size Cmpr CRC.32
# 1 .bash_history 8269 2017-02-20 11:31:00 Defl:N 2717 67% 99c8d736
# 2 .bash_logout 220 2016-04-23 05:36:00 Defl:N 158 28% 6ce3189b
# 3 .bashrc 3771 2016-04-23 05:36:00 Defl:N 1740 54% ab254644