Search code examples
rubygziparchivetarzlib

Read the file names or the number of files in tar.gz


I have a tar.gz file, which holds multiple csv files archived. I need to read the list of the file names or at least the number of files.

This is what I tried:

require 'zlib'

file = Zlib::GzipReader.open('test/data/file_name.tar.gz')
file.each_line do |line|
  p line
end

but this only prints each line in the csv files, not the file names. I also tried this:

require 'zlib'

Zlib::GzipReader.open('test/data/file_name.tar.gz') { | f |
  p f.read
}

which reads similarly, but character by character instead of line by line.

Any idea how I could get the list of file names or at least the number of files within the archive?


Solution

  • You need to use a tar reader on the uncompressed output.

    ".tar.gz" means that two processes were applied to generate the file. First a set of files were "tarred" to make a ".tar" file which contains a sequence of (file header block, uncompressed file data) units. Then that was gzipped as a single stream of bytes, to make the ".tar.gz". In reality, the .tar file was very likely never stored anywhere, but generated as a stream of bytes and gzipped on the fly to write out the .tar.gz file directly.

    To get the contents, you reverse the process, ungzipping, and then feeding the result of that to a tar reader to interpret the file header blocks and extract the data. Again, you can ungzip and read the tarred file contents on the fly, with no need to store the intermediate .tar file.