Search code examples
bashshellunzip

read the files one by one in a zip file using bash


I want to open the files inside a .zip file and read them. In this zip file, I have numerous .gz files, like a.dat.gz, b.dat.gz, and so on.

My code so far:

for i in $(unzip -p sample.zip)
do
    for line in $(zcat "$i")
    do
        # do some stuff here
    done
done

Solution

  • You are correct in needing two loops. First, you need a list of files inside the archive. Then, you need to iterate within each of those files.

    unzip -l sample.zip |sed '
      /^ *[0-9][0-9]* *2[0-9-]*  *[0-9][0-9]:[0-9][0-9]  */!d; s///
    ' |while IFS= read file
      unzip -p sample.zip "$file" |gunzip -c |while IFS= read line
        # do stuff to "$line" here
      done
    done
    

    This assumes that each file in the zip archive is itself a gzip archive. You'll otherwise get an error from gunzip.

    Code walk

    unzip -l archive.zip will list the contents. Its raw output looks like this:

    Archive:  test.zip
      Length      Date    Time    Name
    ---------  ---------- -----   ----
            9  2017-08-24 13:45   1.txt
            9  2017-08-24 13:45   2.txt
    ---------                     -------
           18                     2 files
    

    We therefore need to parse it. I've chosen to parse with sed because it's fast, simple, and preserves whitespace properly (what if you have files with tabs in their names?) Note, this will not work if files have line breaks in them. Don't do that.

    The sed command uses a regex (explanation here) to match the entirety of lines containing file names except for the file names themselves. When the matcher fires, sed is told not to delete (!d), which really tells sed to skip anything that does not match (like the title line). A second command, s///, tells sed to replace the previously matched text with an empty string, therefore the output is one file name per line. This gets piped into a while loop as $file. (The IFS= part before read prevents spaces from being stripped from either end, see the comments below.)

    We can then unzip just the file we're iterating on, again using unzip -p to get it printed to standard output so it can be stored in the inner while loop as $line.

    Experimental simplification

    I'm not sure how reliable this would be, but you might be able to do this more simply as:

    unzip -p sample.zip |gunzip -c |while read line
      # do stuff to "$line"
    done
    

    This should work because unzip -p archive spits out the contents of each file in the archive, all concatenated together without any delimiters or metadata (like the file name) and because the gzip format accepts concatenating archives together (see my notes on concatenated archives), so the gunzip -c pipeline command sees raw gzip data and decompresses it out on the console, which is then passed to the shell's while loop. You will lack file boundaries and names in this approach, but it's much faster.