Search code examples
powershellfindextractgetcontent

PowerShell Extract text between two strings with -Tail and -Wait


I have a text file with a large number of log messages. I want to extract the messages between two string patterns. I want the extracted message to appear as it is in the text file.

I tried the following methods. It works, but doesn't support Get-Content's -Wait and -Tail options. Also, the extracted results are displayed in one line, but not like the text file. Inputs are welcome :-)

Sample Code

function GetTextBetweenTwoStrings($startPattern, $endPattern, $filePath){

    # Get content from the input file
    $fileContent = Get-Content $filePath

    # Regular expression (Regex) of the given start and end patterns
    $pattern = "$startPattern(.*?)$endPattern"

    # Perform the Regex opperation
    $result = [regex]::Match($fileContent,$pattern).Value

    # Finally return the result to the caller
    return $result
}

# Clear the screen
Clear-Host

$input = "THE-LOG-FILE.log"
$startPattern = 'START-OF-PATTERN'
$endPattern = 'END-OF-PATTERN'

# Call the function
GetTextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $input

Improved script based on Theo's answer. The following points need to be improved:

  1. The beginning and end of the output is somehow trimmed despite I adjusted the buffer size in the script.
  2. How to wrap each matched result into START and END string?
  3. Still I could not figure out how to use the -Wait and -Tail options

Updated Script

# Clear the screen
Clear-Host

# Adjust the buffer size of the window
$bw = 10000
$bh = 300000
if ($host.name -eq 'ConsoleHost') # or -notmatch 'ISE'
{
  [console]::bufferwidth = $bw
  [console]::bufferheight = $bh
}
else
{
    $pshost = get-host
    $pswindow = $pshost.ui.rawui
    $newsize = $pswindow.buffersize
    $newsize.height = $bh
    $newsize.width = $bw
    $pswindow.buffersize = $newsize
}


function Get-TextBetweenTwoStrings ([string]$startPattern, [string]$endPattern, [string]$filePath){
    # Get content from the input file
    $fileContent = Get-Content -Path $filePath -Raw
    # Regular expression (Regex) of the given start and end patterns
    $pattern = '(?is){0}(.*?){1}' -f [regex]::Escape($startPattern), [regex]::Escape($endPattern)
    # Perform the Regex operation and output
    [regex]::Match($fileContent,$pattern).Groups[1].Value
}

# Input file path
 $inputFile = "THE-LOG-FILE.log"

# The patterns
$startPattern = 'START-OF-PATTERN'
$endPattern = 'END-OF-PATTERN'


Get-TextBetweenTwoStrings -startPattern $startPattern -endPattern $endPattern -filePath $inputFile

Solution

    • You need to perform streaming processing of your Get-Content call, in a pipeline, such as with ForEach-Object, if you want to process lines as they're being read.

      • This is a must if you're using Get-Content -Wait, because such a call doesn't terminate by itself (it keeps waiting for new lines to be added to the file, indefinitely), but inside a pipeline its output can be processed as it is being received, even before the command terminates.
    • You're trying to match across multiple lines, which with Get-Content output would only work if you used the -Raw switch - by default, Get-Content reads its input file(s) line by line.

      • However, -Raw is incompatible with -Wait.
      • Therefore, you must stick with line-by-line processing, which requires that you match the start and end patterns separately, and keep track of when you're processing lines between those two patterns.

    Here's a proof of concept, but note the following:

    • -Tail 100 is hard-coded - adjust as needed or make it another parameter.

    • The use of -Wait means that the function will run indefinitely - waiting for new lines to be added to $filePath - so you'll need to use Ctrl-C to stop it.

      • While you can use a Get-TextBetweenTwoStrings call itself in a pipeline for object-by-object processing, assigning its result to a variable ($result = ...) won't work when terminating with Ctrl-C, because this method of termination also aborts the assignment operation.

      • To work around this limitation, the function below is defined as an advanced function, which automatically enables support for the common -OutVariable parameter, which is populated even in the event of termination with Ctrl-C; your sample call would then look as follows (as Theo notes, don't use the automatic $input variable as a custom variable):

        # Look for blocks of interest in the input file, indefinitely,
        # and output them as they're being found.
        # After termination with Ctrl-C, $result will also contain the blocks
        # found, if any.
        Get-TextBetweenTwoStrings -OutVariable result -startPattern $startPattern -endPattern $endPattern -filePath $inputFile
        
    • Per your feedback you want the block of lines to encompass the full lines on which the start and end patterns match, so the regexes below are enclosed in .*

    • The word pattern in your $startPattern and $endPattern parameters is a bit ambiguous in that it suggests that they themselves are regexes that can therefore be used as-is or embedded as-is in a larger regex on the RHS of the -match operator.
      However, in the solution below I am assuming that they are be treated as literal strings, which is why they are escaped with [regex]::Escape(); simply omit these calls if these parameters are indeed regexes themselves; i.e.:

      $startRegex = '.*' + $startPattern + '.*'
      $endRegex = '.*' + $endPattern + '.*'
      
    • The solution assumes there is no overlap between blocks and that, in a given block, the start and end patterns are on separate lines.

    • Each block found is output as a single, multi-line string, using LF ("`n") as the newline character; if you want a CRLF newline sequences instead, use "`r`n"; for the platform-native newline format (CRLF on Windows, LF on Unix-like platforms), use [Environment]::NewLine.

    # Note the use of "-" after "Get", to adhere to PowerShell's
    # "<Verb>-<Noun>" naming convention.
    function Get-TextBetweenTwoStrings {
    
      # Make the function an advanced one, so that it supports the 
      # -OutVariable common parameter.
      [CmdletBinding()]
      param(
        $startPattern, 
        $endPattern, 
        $filePath
      )
    
      # Note: If $startPattern and $endPattern are themselves
      #       regexes, omit the [regex]::Escape() calls.
      $startRegex = '.*' + [regex]::Escape($startPattern) + '.*'
      $endRegex = '.*' + [regex]::Escape($endPattern) + '.*'
    
      $inBlock = $false
      $block = [System.Collections.Generic.List[string]]::new()
    
      Get-Content -Tail 100 -Wait $filePath | ForEach-Object {
        if ($inBlock) {
          if ($_ -match $endRegex) {
            $block.Add($Matches[0])
            # Output the block of lines as a single, multi-line string
            $block -join "`n"
            $inBlock = $false; $block.Clear()       
          }
          else {
            $block.Add($_)
          }
        }
        elseif ($_ -match $startRegex) {
          $inBlock = $true
          $block.Add($Matches[0])
        }
      }
    
    }