Search code examples
linuxunixscriptingxargspdftotext

how to couple xargs with pdftotext converter to search inside multiple pdf files


I am making a script which is supposed to search inside all the pdf files in a directory. I have found one converted named "pdftotext" which enables me to use grep on pef files, but I am able to run it only with one file. When I want to run it over all the files present in directory then it fails. Any suggestions ?

This works:for a single file

pdftotext my_file.pdf - | grep 'hot'

This fails: for searching pdf files and converting to text and greping

SHELL PROMPT>find ~/.personal/tips -type f -iname "*" | grep -i "*.pdf" | xargs pdftotext |grep admin
pdftotext version 3.00
Copyright 1996-2004 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>          : first page to convert
  -l <int>          : last page to convert
  -layout           : maintain original physical layout
  -raw              : keep strings in content stream order
  -htmlmeta         : generate a simple HTML file, including the meta information
  -enc <string>     : output text encoding name
  -eol <string>     : output end-of-line convention (unix, dos, or mac)
  -nopgbrk          : don't insert page breaks between pages
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -q                : don't print any messages or errors
  -cfg <string>     : configuration file to use in place of .xpdfrc
  -v                : print copyright and version info
  -h                : print usage information
  -help             : print usage information
  --help            : print usage information
  -?                : print usage information
SHELL PROMPT 139>

Solution

  • xargs is the wrong tool for this job: find does everything you need built-in.

    find ~/.personal/tips \
        -type f \
        -iname "*.pdf" \
        -exec pdftotext '{}' - ';' \
      | grep hot
    

    That said, if you did want to use xargs for some reason, correct usage would look something like...

    find ~/.personal/tips \
        -type f \
        -iname "*.pdf" \
        -print0 \
      | xargs -0 -J % -n 1 pdftotext % - \
      | grep hot
    

    Note that:

    • The find command uses -print0 to NUL-delimit its output
    • The xargs command uses -0 to NUL-delimit its input (which also turns off some behavior which would lead to incorrect handling of filenames with whitespace in their names, literal quote characters, etc).
    • The xargs command uses -n 1 to call pdftotext once per file
    • The xargs command uses -J % to specify a sigil for where the replacement should happen, and uses that % in the pdftotext command line appropriately.