Search code examples
regexpowershell-4.0lookbehind

Parsing a log into record blocks with Powershell RegEx DOTALL and a lookbehind assertion


This question applies to the task of parsing a very large unstructured log file using Powershell 4.0, applying a regular expression with a look behind assertion, and a dotall modifier.

A single record in the log documents a process on several lines of various transaction attempts. I want to be able to split up the log into discrete records using starting and ending lines that can be identified by a success message. The success message marks the end of a record being processed. The line that follows is always the start of a new record.

Once the log is broken up into an array of discrete records, I will then more confidently be able to grab critical pieces of data from each record. That's the current logic, anyway - but I'm not concerned with this part of the process for now. I'll do that later.

A highly-simplified chunk of the log looks like this:

20151120 11:10:31 UPDATE ARI has value [].
20151120 11:10:31 ERROR returning from process_updid with invalid NICS query - no ARI code: []..
20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 UPDATE Tag SSN has value [].
20151120 11:10:31 UPDATE Tag SOC has value [].

20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 ONE This is some random text that I just made up.
20151120 11:10:31 TWO This is more random text that I just made up.

20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 THREE This is additional random text that I just made up.
20151120 11:10:31 FOUR This is still more random text that I just made up.

The message line that alerts a reader to the end of a process, and the start of a new record appears like this:

20151120 11:10:31 INFO transaction processed successfully.

Everything after that line, until the next success message is a complete record.

The regular expression pattern that I have so far is:

(?<=\d{8}\s\d{2}:\d{2}:\d{2}\sINFO transaction processed successfully\.)(?s)(.+)

This pattern correctly identifies the first success message, but then includes subsequent success messages in that first record, and repeats the same record for a second match. The (.+) expression is grabbing too much. I tried an ungreedy (+?) quantifier - with no match; as well as a lookahead assertion to identify a stopping point at the next success message - again no joy.

The full Powershell code is:

Clear-Host

$s = @"
20151120 11:10:31 UPDATE ARI has value [].
20151120 11:10:31 ERROR returning from process_updid with invalid NICS query - no ARI code: []..
20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 UPDATE Tag SSN has value [].
20151120 11:10:31 UPDATE Tag SOC has value [].

20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 ONE This is some random text that I just made up.
20151120 11:10:31 TWO This is more random text that I just made up.

20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 THREE This is additional random text that I just made up.
20151120 11:10:31 FOUR This is still more random text that I just made up.
"@

$p = "(?<=\d{8}\s\d{2}:\d{2}:\d{2}\sINFO transaction processed successfully\.)(?s)(.+)"

$s | Select-String $p -AllMatches | Foreach {$_.Matches}

Thank you for any guidance.


Solution

  • Never mind the lookbehind, just use this:

    (?:\d{8}\s\d{2}:\d{2}:\d{2}\s(?!INFO transaction processed successfully\.).+\n?)+
    

    DEMO

    It matches one or more lines that don't match the pattern of a success message. If you're not sure how to approach a problem, lookbehind should never be the first tool you reach for. Usually it just makes the job more difficult. DOTALL/Singleline mode does too, to a lesser extent, plus it makes you more vulnerable to never-ending matches.

    Another option is to Split on a pattern that does match a success message:

    \s*\d{8}\s\d{2}:\d{2}:\d{2}\sINFO transaction processed successfully\.\s*