Search code examples
regexpcrepcregrep

Why does this multi-line regular expression include the following line?


I have the following input and I'd like to write a regular expression which would match every line except the first and last.

2019-03-13 00:33:44,846 [INFO] -:  foo
2019-03-13 00:33:45,096 [INFO] -:  Exception sending email
To:
[foo@bar.com, bar@bar.com]
CC:
[baz@bar.com]
Subject:
some subject
Body:
some

body
2019-03-13 00:33:45,190 [INFO] -:  bar

I thought the following should work, but it doesn't match anything:

pcregrep -M ".+Exception sending email[\S\s]+?(?=\d{4}(-\d\d){2})" ~/test.log

In plain English I would describe this as: look for a line with the exception text, followed by any character (including newlines) non-greedily, until we hit a positive lookahead for a date.

For some reason this also includes the final line, even though it doesn't on regex101. What am I missing here?


In a lot of cases, I would just use grep -A in a case like this but the problem is that the body could be any arbitrary number of lines.


Solution

  • It almost certainly has to do with the tool. As the changelog for pcregrep states under "Version 8.12 15-Jan-2011" :

    1. In pcregrep, when a pattern that ended with a literal newline sequence was matched in multiline mode, the following line was shown as part of the match. This seems wrong, so I have changed it.

    A simple fix would be to add a newline character inside the lookahead expression, which will pull it out of the match and prevent the last line from showing :

    pcregrep -M ".+Exception sending email[\S\s]+?(?=[\r\n]\d{4}(-\d\d){2})" ~/test.log