Search code examples
regexpdfgrepmultiline

pdfgrep pattern to include/exclude linebreak


pdfgrep works like grep except that it acts on pages instead of lines. How can I craft a regular expression with a newline character?

I want to look for a, followed by any number of characters except linebreaks, followed by b, but pdfgrep 'a[^\n]*b' doesn't work, whereas pdfgrep 'a.*b' returns results that span multiple lines. (I've examined the output with xxd to confirm that these newlines are indeed \x0A.)


Solution

  • By default, pdfgrep uses a POSIX compliant regex flavor where . matches any char including line break chars.

    Fortunately, pdfgrep also supports PCRE regex flavor with the help of -P flag. In a PCRE regex flavor, . matches any char but line break chars.

    Thus, you can use

    pdfgrep -P 'a.*b'