Search code examples
rintegeroverflowunzip

Failure of unz() to unzip from a zip file offset of more than 2^31 bytes


I have been obtaining .zip archives of genome annotation from NCBI (mainly gff files). In order save disk space I prefer not to unzip the archive, but to read these files directly into R using unz(). However, it seems that unz() is unable to extract files from the end of 'large' zip files:

ncbi.zip <- "file_location/name.zip"
files <- unzip(ncbi.zip, list=TRUE)
gff.files <- files$Name[ grep("gff$", files$Name) ]

## this works
gff.128 <- readLines( unz(ncbi.zip, gff.files[128]) )

## this gives an empty data structure (read.table() stops
## with an error saying no lines or similar
gff.129 <- readLines( unz(ncbi.zip, gff.files[129]) )

## there are 31 more gff files after the 129th one.
## no lines are read from any of these.

The zip file itself seems to be fine; I can unzip the specific files using unzip on the command line and unzip -t does not report any errors.

I've tried this with R versions 3.5 (openSuse Leap 15.1), 3.6, and 4.2 (centOS 7) and with more than one zip file and get exactly the same result.

I attached strace to R whilst reading in the 128 and 129th file. In both cases I get a lot of lseek towards the end of file (offset 2845892608, larger than 2^31) to start with. This is where I assume the zip directory can be found. For the 128th file (the one that can be read), I eventually get an lseek to an offset slightly below 2^31, followed by a set of lseeks and reads (that extend beyone 2^31).

For the 129th file, I get the same reads towards the end of the file, but then rather than finding a position within the file I get:

lseek(3, 2845933568, SEEK_SET)          = 2845933568
lseek(3, 4294963200, SEEK_SET)          = 4294963200
read(3, "", 4096)                       = 0
lseek(3, 4095, SEEK_CUR)                = 4294967295
read(3, "", 4096)                       = 0

Which is a bit weird since the file itself is only about 2.8 GB. 4294967295, is of course 2^32 - 1.

To me this feels like an integer overflow bug, and I am considering to post a bug report. But am wondering if anyone has seen something similar before or if I am doing something stupid.


Solution

  • Having done what I should have started with (reading the specification for the zip64 format specification), it's actually clear that this is not an integer overflow error.

    Zip files contain a central directory at the end of the archive; this contains amongst other things the names of the compressed files and the offset of the compressed data in the zip archive. The offset (and file size fields) are only given 4 bytes each in the standard directory field; when the offset is larger than this it should instead be given in the extra fields section and the value in the standard field should be set to 0xFFFFFFFF. Since this is the offset that gets used when reading the file it seems clear that the problem lies in the parsing of the extra field.

    I had a look at the source code for R 4.2.1 and it seems that the problem is due to the way the offset specified in the standard offset field is tested:

    if(file_info.uncompressed_size == (ZPOS64_T)(unsigned long)-1)
    

    changing this == 0xFFFFFFFF seems to fix the problem.

    I've submitted a bug report to R. Hopefully changing the check will not have any unintended consequences and the issue will be fixed.

    Still, I'm curious as to whether anyone else has come across the same issue. Seems a bit unlikely that my experience is unique.