Search code examples
regexloggingregex-lookaroundsflat-file

multiline regex with lookahead


I`m currently trying to read a log file with regex. My logs begin with a timestamp followed by a random multiline message which can include multiple new lines, returns and all types of character.

The regex should capture everything starting with the timestamp, the actual log message until we reach a new timestamp. At the moment I do this by using a positive lookahead till the next timestamp.

On the webside regex101 the code works more or less. In our security event manager the same regex doesn't work. I need to save every event with the timestamp being the first capturing group and the log message being the second capturing group.

(\w{3}\s{1}\w{3}\s{1}\d{2}\s{1}\d{2}\:\d{2}\:\d{2}\s{1}\d{4})((\r||.|\n)*)(?=(\w{3}\s{1}\w{3}\s{1}\d{2}\s{1}\d{2}\:\d{2}\:\d{2}\s{1}\d{4}))

Example log:

Tue Sep 14 08:57:47 2021 Thread 1 advanced to log sequence 186 (LGWR switch) Current log# 2 seq# 186 mem# 0: D:\ORADB\DV1\REDO02A.LOG Current log# 2 seq# 186 mem# 1: H:\ORADB\DV1\REDO02B.LOG Tue Sep 14 09:07:40 2021 Thread 1 advanced to log sequence 187 (LGWR switch) Current log# 3 seq# 187 mem# 0: D:\ORADB\DV1\REDO03A.LOG Current log# 3 seq# 187 mem# 1: H:\ORADB\DV1\REDO03B.LOG Tue Sep 14 09:22:09 2021 Thread 1 advanced to log sequence 188 (LGWR switch) Current log# 4 seq# 188 mem# 0: D:\ORADB\DV1\REDO04A.LOG Current log# 4 seq# 188 mem# 1: H:\ORADB\DV1\REDO04B.LOG

regex101

Btw the code only works when I include the \r||.|\n "or null" part of the regex, which I dont understand at all.


Solution

  • You can use [\s\S]* to match any characters since \s is for whitespace (including new lines) while \S is for non-whitespace. For it to not span the whole text, in other words for it to be non-greedy, use the ? symbol e.g. [\s\S]*? Try this pattern:

    (\w{3}\s\w{3}\s\d{2}\s\d{2}:\d{2}:\d{2}\s\d{4})([\s\S]*?)(?=\w{3}\s\w{3}\s\d{2}\s\d{2}:\d{2}:\d{2}\s\d{4}|\Z)
    

    enter image description here

    Where:

    • ( - Start of 1st capturing group
      • \w{3}\s\w{3}\s\d{2}\s - Match Tue Sep 14
      • \d{2}:\d{2}:\d{2}\s\d{4} - Match 08:57:47 2021
    • ) - End of 1st capturing group
    • ( - Start of 2nd capturing group
      • [\s\S]*? - Match any characters including new lines. The match will be in a non-greeedy way (thus the least possible match).
    • ) - End of 2nd capturing group
    • (?= - Start of look ahead assertion
      • \w{3}\s\w{3}\s\d{2}\s\d{2}:\d{2}:\d{2}\s\d{4} - The next part must either be the timestamp (this is the same pattern as the matching of the timestamp in the first part of this whole regex).
      • | - Or
      • \Z - Or the next part must be the end of string
    • ) - End of look ahead assertion. Note that since the pattern before this is non-greedy, this will always be the closest timestamp, thus is always the next timestamp.