Search code examples
bashpdfcommand-linefindunzip

Unzip to pipe and then run PDF info on the files in the stream


I want to unzip a LOT of files and then run pdfinfo to get the page count for each file and the sum those page counts.

I came across a command that will sum the pages of all pages in a directory.

find . -name \*.pdf -exec pdfinfo {} \; | grep Pages | sed -e "s/Pages:\s*//g" | awk '{ sum += $1;} END { print sum; }'

I then thought to pipe that into #unzip -p

unzip -p '*.zip' | find . -name \*.pdf -exec pdfinfo {} \; | grep Pages | sed -e "s/Pages:\s*//g" | awk '{ sum += $1;} END { print sum; }'

However it's not working as I expect it to. I suspect that my unzip stream is interacting poorly with the find.

Any Thoughts?


Solution

  • Here is a way to do it that doesn't write anything to the filesystem. This code should work if any of the filenames in the zip files contain embedded whitespace. The code assumes that filenames ending in "pdf" are valid PDF files.

    This is the test zip file I'm going to use. Note that the first filename in the zip archive, "zlib 3.pdf", contains a space character.

    $ unzip -l aaa.zip 
    Archive:  aaa.zip
      Length      Date    Time    Name
    ---------  ---------- -----   ----
        19318  2018-02-19 22:49   zlib 3.pdf
       442780  2018-02-28 15:32   file2.pdf
          757  2018-02-28 15:22   try.sh
    ---------                     -------
       462855                     3 files
    

    It turns out that pdfinfo can read from stdin, so the command below shows how to get the number of pages from a pdf stored in a zip without writing anything to disk.

    $ unzip -p aaa.zip file2.pdf | pdfinfo - | grep Pages
    Pages:          94
    
    $ unzip -p aaa.zip "zlib 3.pdf" | pdfinfo - | grep Pages
    Pages:          2
    

    For this to work though, you need to know the names of the PDF files stored in the zip archive.

    The next step then is to get a list of the PDF files and the names of the zip files they are stored in. That's what this code does

    for zip in *.zip ; do 
        echo $zip
        zipinfo -1 "$zip" | grep 'pdf$'| while read pdf
        do
            echo "  '$pdf'" 
        done  
    done 
    

    That outputs this for me

    aaa.zip
      'zlib 3.pdf'
      'file2.pdf'
    

    Finally add the code to call pdfinfo and the awk code snippet to work out the total number of pages.

    for zip in *.zip ; do 
        zipinfo -1 "$zip" | grep 'pdf$'| while read pdf
        do
            unzip  -p "$zip" "$pdf" | pdfinfo - | grep Pages | sed -e "s/Pages:\s*//g"
        done  
    done | awk '{ sum += $1;} END { print sum; }'
    

    That outputs 96 for my test zip file.