This question applies to the task of parsing a very large unstructured log file using Powershell 4.0, applying a regular expression with a look behind assertion, and a dotall modifier.
A single record in the log documents a process on several lines of various transaction attempts. I want to be able to split up the log into discrete records using starting and ending lines that can be identified by a success message. The success message marks the end of a record being processed. The line that follows is always the start of a new record.
Once the log is broken up into an array of discrete records, I will then more confidently be able to grab critical pieces of data from each record. That's the current logic, anyway - but I'm not concerned with this part of the process for now. I'll do that later.
A highly-simplified chunk of the log looks like this:
20151120 11:10:31 UPDATE ARI has value [].
20151120 11:10:31 ERROR returning from process_updid with invalid NICS query - no ARI code: []..
20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 UPDATE Tag SSN has value [].
20151120 11:10:31 UPDATE Tag SOC has value [].
20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 ONE This is some random text that I just made up.
20151120 11:10:31 TWO This is more random text that I just made up.
20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 THREE This is additional random text that I just made up.
20151120 11:10:31 FOUR This is still more random text that I just made up.
The message line that alerts a reader to the end of a process, and the start of a new record appears like this:
20151120 11:10:31 INFO transaction processed successfully.
Everything after that line, until the next success message is a complete record.
The regular expression pattern that I have so far is:
(?<=\d{8}\s\d{2}:\d{2}:\d{2}\sINFO transaction processed successfully\.)(?s)(.+)
This pattern correctly identifies the first success message, but then includes subsequent success messages in that first record, and repeats the same record for a second match. The (.+) expression is grabbing too much. I tried an ungreedy (+?) quantifier - with no match; as well as a lookahead assertion to identify a stopping point at the next success message - again no joy.
The full Powershell code is:
Clear-Host
$s = @"
20151120 11:10:31 UPDATE ARI has value [].
20151120 11:10:31 ERROR returning from process_updid with invalid NICS query - no ARI code: []..
20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 UPDATE Tag SSN has value [].
20151120 11:10:31 UPDATE Tag SOC has value [].
20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 ONE This is some random text that I just made up.
20151120 11:10:31 TWO This is more random text that I just made up.
20151120 11:10:31 INFO transaction processed successfully.
20151120 11:10:31 THREE This is additional random text that I just made up.
20151120 11:10:31 FOUR This is still more random text that I just made up.
"@
$p = "(?<=\d{8}\s\d{2}:\d{2}:\d{2}\sINFO transaction processed successfully\.)(?s)(.+)"
$s | Select-String $p -AllMatches | Foreach {$_.Matches}
Thank you for any guidance.
Never mind the lookbehind, just use this:
(?:\d{8}\s\d{2}:\d{2}:\d{2}\s(?!INFO transaction processed successfully\.).+\n?)+
It matches one or more lines that don't match the pattern of a success message. If you're not sure how to approach a problem, lookbehind should never be the first tool you reach for. Usually it just makes the job more difficult. DOTALL/Singleline mode does too, to a lesser extent, plus it makes you more vulnerable to never-ending matches.
Another option is to Split on a pattern that does match a success message:
\s*\d{8}\s\d{2}:\d{2}:\d{2}\sINFO transaction processed successfully\.\s*