Unzip to pipe and then run PDF info on the files in the stream

I want to unzip a LOT of files and then run pdfinfo to get the page count for each file and the sum those page counts.

I came across a command that will sum the pages of all pages in a directory.

find . -name \*.pdf -exec pdfinfo {} \; | grep Pages | sed -e "s/Pages:\s*//g" | awk '{ sum += $1;} END { print sum; }'

I then thought to pipe that into #unzip -p

unzip -p '*.zip' | find . -name \*.pdf -exec pdfinfo {} \; | grep Pages | sed -e "s/Pages:\s*//g" | awk '{ sum += $1;} END { print sum; }'

However it's not working as I expect it to. I suspect that my unzip stream is interacting poorly with the find.

Any Thoughts?

Solution

Here is a way to do it that doesn't write anything to the filesystem. This code should work if any of the filenames in the zip files contain embedded whitespace. The code assumes that filenames ending in "pdf" are valid PDF files.

This is the test zip file I'm going to use. Note that the first filename in the zip archive, "zlib 3.pdf", contains a space character.

$ unzip -l aaa.zip 
Archive:  aaa.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
    19318  2018-02-19 22:49   zlib 3.pdf
   442780  2018-02-28 15:32   file2.pdf
      757  2018-02-28 15:22   try.sh
---------                     -------
   462855                     3 files

It turns out that pdfinfo can read from stdin, so the command below shows how to get the number of pages from a pdf stored in a zip without writing anything to disk.

$ unzip -p aaa.zip file2.pdf | pdfinfo - | grep Pages
Pages:          94

$ unzip -p aaa.zip "zlib 3.pdf" | pdfinfo - | grep Pages
Pages:          2

For this to work though, you need to know the names of the PDF files stored in the zip archive.

The next step then is to get a list of the PDF files and the names of the zip files they are stored in. That's what this code does

for zip in *.zip ; do 
    echo $zip
    zipinfo -1 "$zip" | grep 'pdf$'| while read pdf
    do
        echo "  '$pdf'" 
    done  
done

That outputs this for me

aaa.zip
  'zlib 3.pdf'
  'file2.pdf'

Finally add the code to call pdfinfo and the awk code snippet to work out the total number of pages.

for zip in *.zip ; do 
    zipinfo -1 "$zip" | grep 'pdf$'| while read pdf
    do
        unzip  -p "$zip" "$pdf" | pdfinfo - | grep Pages | sed -e "s/Pages:\s*//g"
    done  
done | awk '{ sum += $1;} END { print sum; }'

That outputs 96 for my test zip file.