Search code examples
runzip

How to read sizes of packed files inside zip archive using R


The following code in R returns a data frame which includes filenames, unpacked length in bytes and the date (without extracting the files).

unzip(path_to_zip, list  = T)

I am wondering how I can also extract the size of the packed (compressed) files or alternatively the compression ratio for each file.

I am using Windows 7 machine.

Thanks!


Solution

  • Using the unzip() function, you cannot: by default, it uses an internal C function that does what it does and nothing more. However, it can use the external executable, and that does allow the verbose information (with -v). In order to use this, you'll need to modify R's unzip() function. The rest of this answer is an exercise of "use the source, luke", showing that the current functions can be extended when needed.

    unzip2 <- function (zipfile, files = NULL, list = FALSE, list.verbose = FALSE, overwrite = TRUE, 
                        junkpaths = FALSE, exdir = ".", unzip = "internal", setTimes = FALSE) {
        if (identical(unzip, "internal")) {
            if (!list && !missing(exdir)) 
                dir.create(exdir, showWarnings = FALSE, recursive = TRUE)
            res <- .External(utils:::C_unzip, zipfile, files, exdir, list, 
                overwrite, junkpaths, setTimes)
            if (list) {
                dates <- as.POSIXct(res[[3]], "%Y-%m-%d %H:%M", tz = "UTC")
                data.frame(Name = res[[1]], Length = res[[2]], Date = dates, 
                    stringsAsFactors = FALSE)
            }
            else invisible(attr(res, "extracted"))
        }
        else {
            WINDOWS <- .Platform$OS.type == "windows"
            if (!is.character(unzip) || length(unzip) != 1L || !nzchar(unzip)) 
                stop("'unzip' must be a single character string")
            zipfile <- path.expand(zipfile)
            if (list) {
                dashl <- if (list.verbose) "-lv" else "-l"
                res <- if (WINDOWS) 
                    system2(unzip, c(dashl, shQuote(zipfile)), stdout = TRUE)
                else system2(unzip, c(dashl, shQuote(zipfile)), stdout = TRUE, 
                    env = c("TZ=UTC"))
                l <- length(res)
                res2 <- res[-c(1, 3, l - 1, l)]
                con <- textConnection(res2)
                on.exit(close(con))
                z <- read.table(con, header = TRUE, as.is = TRUE)
                dt <- paste(z$Date, z$Time)
                formats <- if (max(nchar(z$Date) > 8)) 
                    c("%Y-%m-%d", "%d-%m-%Y", "%m-%d-%Y")
                else c("%m-%d-%y", "%d-%m-%y", "%y-%m-%d")
                slash <- any(grepl("/", z$Date))
                if (slash) 
                    formats <- gsub("-", "/", formats)
                formats <- paste(formats, "%H:%M")
                for (f in formats) {
                    zz <- as.POSIXct(dt, tz = "UTC", format = f)
                    if (all(!is.na(zz))) 
                      break
                }
                z[, "Date"] <- zz
                z <- z[, colnames(z) != "Time"]
                nms <- c("Name", "Length", "Date")
                z[, c(nms, setdiff(colnames(z), nms))]
            }
            else {
                args <- c("-oq", shQuote(zipfile))
                if (length(files)) 
                    args <- c(args, shQuote(files))
                if (exdir != ".") 
                    args <- c(args, "-d", shQuote(exdir))
                system2(unzip, args, stdout = NULL, stderr = NULL, 
                    invisible = TRUE)
                invisible(NULL)
            }
        }
    }
    

    In this, I modified lines: 1 (arguments), 6 (utils:::), 21-25 (dashl), 45 and added 46-47 (column choosing). The rest is from the original R unzip function.

    By default, unzip2 will behave exactly as unzip, meaning it will not give you what you want. In order to get your desired results, you need to (a) tell it where you external unzip.exe is located, and (b) tell it you want it to be verbose. (Feel free to modify the above definition to change the defaults.)

    Note that on Windows, unzip.exe is typically not installed by default. It is included in Rtools, Git-for-Windows, and msys2. You may need a little more effort to ensure Sys.which("unzip") finds the executable.

    This uses the (default) internal C function, meaning nothing more can come.

    unzip2("~/bashdotfiles.zip", list = TRUE)
    #            Name Length                Date
    # 1 .bash_history   8269 2017-02-20 03:31:00
    # 2  .bash_logout    220 2016-04-22 22:36:00
    # 3       .bashrc   3771 2016-04-22 22:36:00
    

    This uses the external executable, and is functionally identical (though notice the dates are different due to internal UTC conversion ... this could be fixed with a little more effort).

    unzip2("~/bashdotfiles.zip", list = TRUE, unzip = Sys.which("unzip"))
    #            Name Length                Date
    # 1 .bash_history   8269 2017-02-20 11:31:00
    # 2  .bash_logout    220 2016-04-23 05:36:00
    # 3       .bashrc   3771 2016-04-23 05:36:00
    

    Finally, the augmented listing:

    unzip2("~/bashdotfiles.zip", list = TRUE, list.verbose = TRUE, unzip = Sys.which("unzip"))
    #            Name Length                Date Method Size Cmpr   CRC.32
    # 1 .bash_history   8269 2017-02-20 11:31:00 Defl:N 2717  67% 99c8d736
    # 2  .bash_logout    220 2016-04-23 05:36:00 Defl:N  158  28% 6ce3189b
    # 3       .bashrc   3771 2016-04-23 05:36:00 Defl:N 1740  54% ab254644