Search code examples
linuxpdffull-text-searchgrepdebian

How to search contents of multiple pdf files?


How could I search the contents of PDF files in a directory/subdirectory? I am looking for some command line tools. It seems that grep can't search PDF files.


Solution

  • Your distribution should provide a utility called pdftotext:

    find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;
    

    The "-" is necessary to have pdftotext output to stdout, not to files. The --with-filename and --label= options will put the file name in the output of grep. The optional --color flag is nice and tells grep to output using colors on the terminal.

    (In Ubuntu, pdftotext is provided by the package xpdf-utils or poppler-utils.)

    This method, using pdftotext and grep, has an advantage over pdfgrep if you want to use features of GNU grep that pdfgrep doesn't support. Note: pdfgrep-1.3.x supports -C option for printing line of context.