I want to unzip a LOT of files and then run pdfinfo to get the page count for each file and the sum those page counts.
I came across a command that will sum the pages of all pages in a directory.
find . -name \*.pdf -exec pdfinfo {} \; | grep Pages | sed -e "s/Pages:\s*//g" | awk '{ sum += $1;} END { print sum; }'
I then thought to pipe that into #unzip -p
unzip -p '*.zip' | find . -name \*.pdf -exec pdfinfo {} \; | grep Pages | sed -e "s/Pages:\s*//g" | awk '{ sum += $1;} END { print sum; }'
However it's not working as I expect it to. I suspect that my unzip stream is interacting poorly with the find.
Any Thoughts?
Here is a way to do it that doesn't write anything to the filesystem. This code should work if any of the filenames in the zip files contain embedded whitespace. The code assumes that filenames ending in "pdf" are valid PDF files.
This is the test zip file I'm going to use. Note that the first filename in the zip archive, "zlib 3.pdf", contains a space character.
$ unzip -l aaa.zip
Archive: aaa.zip
Length Date Time Name
--------- ---------- ----- ----
19318 2018-02-19 22:49 zlib 3.pdf
442780 2018-02-28 15:32 file2.pdf
757 2018-02-28 15:22 try.sh
--------- -------
462855 3 files
It turns out that pdfinfo
can read from stdin, so the command below shows how to get the number of pages from a pdf stored in a zip without writing anything to disk.
$ unzip -p aaa.zip file2.pdf | pdfinfo - | grep Pages
Pages: 94
$ unzip -p aaa.zip "zlib 3.pdf" | pdfinfo - | grep Pages
Pages: 2
For this to work though, you need to know the names of the PDF files stored in the zip archive.
The next step then is to get a list of the PDF files and the names of the zip files they are stored in. That's what this code does
for zip in *.zip ; do
echo $zip
zipinfo -1 "$zip" | grep 'pdf$'| while read pdf
do
echo " '$pdf'"
done
done
That outputs this for me
aaa.zip
'zlib 3.pdf'
'file2.pdf'
Finally add the code to call pdfinfo
and the awk code snippet to work out the total number of pages.
for zip in *.zip ; do
zipinfo -1 "$zip" | grep 'pdf$'| while read pdf
do
unzip -p "$zip" "$pdf" | pdfinfo - | grep Pages | sed -e "s/Pages:\s*//g"
done
done | awk '{ sum += $1;} END { print sum; }'
That outputs 96 for my test zip file.