Search code examples
rtartidyverse

R: Read single file from within a tar.gz directory


Consider a tar.gz file of a directory which containing a lot of individual files.

From within R I can easily extract the name of the individual files with this command:

fileList <- untar(my_tar_dir.tar.gz, list=T)

Using only R is it possible to directly read/load a single of those files into R (aka without first unpacking and writing the file to the disk)?


Solution

  • It is possible, but I don't know of any clean implementation (it may exist). Below is some very basic R code that should work in many cases (e.g. file names with full path inside the archive should be less than 100 characters). In a way, it's just re-implementing "untar" in an extremely crude way, but in such a way that it will point to the desired file in a gzipped file.

    The first problem is that you should only read a gzipped file from the start. Using "seek()" to re-position the file pointer to the desired file is, unfortunately, erratic in a gzipped file.

    ParseTGZ<- function(archname){
      # open tgz archive
      tf <- gzfile(archname, open='rb')
      on.exit(close(tf))
      fnames <- list()
      offset <- 0
      nfile <- 0
      while (TRUE) {
        # go to beginning of entry
        # never use "seek" to re-locate in a gzipped file!
        if (seek(tf) != offset) readBin(tf, what="raw", n= offset - seek(tf))
        # read file name
        fName <- rawToChar(readBin(tf, what="raw", n=100))
        if (nchar(fName)==0) break
        nfile <- nfile + 1
        fnames <- c(fnames, fName)
        attr(fnames[[nfile]], "offset") <- offset+512
        # read size, first skip 24 bytes (file permissions etc)
        # again, we only use readBin, not seek()
        readBin(tf, what="raw", n=24)
        # file size is encoded as a length 12 octal string, 
        # with the last character being '\0' (so 11 actual characters)
        sz <- readChar(tf, nchars=11) 
        # convert string to number of bytes
        sz <- sum(as.numeric(strsplit(sz,'')[[1]])*8^(10:0))
        attr(fnames[[nfile]], "size") <- sz
    #    cat(sprintf('entry %s, %i bytes\n', fName, sz))
        # go to the next message
        # don't forget entry header (=512) 
        offset <- offset + 512*(ceiling(sz/512) + 1)
      }
    # return a named list of characters strings with attributes?
      names(fnames) <- fnames
      return(fnames)
    }
    

    This will give you the exact position and length of all files in the tar.gz archive. Now the next step is to actually extact a single file. You may be able to do this by using a "gzfile" connection directly, but here I will use a rawConnection(). This presumes your files fit into memory.

    extractTGZ <- function(archfile, filename) {
      # this function returns a raw vector
      # containing the desired file
      fp <- ParseTGZ(archfile)
      offset <- attributes(fp[[filename]])$offset
      fsize <- attributes(fp[[filename]])$size
      gzf <- gzfile(archfile, open="rb")
      on.exit(close(gzf))
      # jump to the byte position, don't use seek()
      # may be a bad idea on really large archives...
      readBin(gzf, what="raw", n=offset)
      # now read the data into a raw vector
      result <- readBin(gzf, what="raw", n=fsize)
      result
    }
    

    now, finally:

    ff <- rawConnection(ExtractTGZ("myarchive", "myfile"))
    

    Now you can treat ff as if it were (a connection pointing to) your file. But it only exists in memory.