Search code examples
performancepowershellreplacefile-iopowershell-5.0

PowerShell5. Modify ascii text file string with line number string is on. Switch and .NET framework or cmdlets & the pipeline? Which is faster?


How to modify a string (LINE2 "line number LINE2 is on") in a windows ascii text file using search strings that are easy to read and easy to add/modify/delete using PowerShell 5. This script will parse a 2500 line file, find 139 instances of the strings, replace them and overwrite the original in less than 165ms on average depending on which method you use. Which method is faster? Which method is easier to add/modify/delete the strings?

Search for strings "AROUND LINE {1-9999}" and "LINE2 {1-9999}" and replace {1-9999} with the {line number} the code is on. The tests were done with a 2500 line file not the two line sample.bat.

sample.bat contains two lines:

ECHO AROUND LINE 5936
TITLE %TIME%   DISPLAY TCP-IP SETTINGS   LINE2 5937

Method One: Using Get-Content + -replace + Set-Content:

Measure-command {
copy-item $env:temp\sample9.bat -d $env:temp\sample.bat -force
(gc $env:temp\sample.bat) | foreach -Begin {$lc = 1} -Process {
  $_ -replace 'AROUND LINE \d+', "AROUND LINE $lc" -replace 'LINE2 \d+', "LINE2 $lc"
  ++$lc
} | sc -Encoding Ascii $env:temp\sample.bat}

Results: 175ms-387ms in ten runs for an average of 215ms.

You modify the search by adding / removing / modifying -replace.

-replace 'AROUND LINE \d+', "AROUND LINE $lc" -replace 'LINE2 \d+', "LINE2 $lc" -replace 'PLACEMARK \d+', "PLACEMARK $lc"

powershell $env:temp\sample.ps1 $env:temp\sample.bat:

(gc $args[0]) | foreach -Begin {$lc = 1} -Process { $_ -replace 'AROUND LINE \d+', "AROUND LINE $lc" -replace 'LINE2 \d+', "LINE2 $lc" ++$lc } | sc -Encoding Ascii $args[0]

Method Two: Using switch and .NET frameworks:

Measure-command {
    copy-item $env:temp\sample9.bat -d $env:temp\sample.bat -force
    $file = "$env:temp\sample.bat"
    $lc = 0
    $updatedLines = switch -Regex ([IO.File]::ReadAllLines($file)) {
      '^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$' { $Matches[1] + ++$lc + $Matches[2] }
      default { ++$lc; $_ }
    }
    [IO.File]::WriteAllLines($file, $updatedLines, [Text.Encoding]::ASCII)}

Results: 73ms-816ms in ten runs for an average of 175ms.

Method Three: Using switch and .NET frameworks optimized version based on a precompiled regex:

Measure-command {
copy-item $env:temp\sample9.bat -d $env:temp\sample.bat -force
$file = "$env:temp\sample.bat"
$regex = [Regex]::new('^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$', 'Compiled, IgnoreCase, CultureInvariant')
$lc = 0
$updatedLines = & {foreach ($line in [IO.File]::ReadLines($file)) {
    $lc++
    $m = $regex.Match($line)
    if ($m.Success) {
        $g = $m.Groups
        $g[1].Value + $lc + $g[2].Value
    } else { $line }
}}
[IO.File]::WriteAllLines($file, $updatedLines, [Text.Encoding]::ASCII)}

Results: 71ms-236ms in ten runs for an average of 106ms.

Add/Modify/Delete your search string:

AROUND LINE|LINE2|PLACEMARK
AROUND LINE|LINE3
LINE4

powershell $env:temp\sample.ps1 $env:temp\sample.bat:

$file=$args[0]
$regex = [Regex]::new('^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$', 'Compiled, IgnoreCase, CultureInvariant')
    $lc = 0
    $updatedLines = & {foreach ($line in [IO.File]::ReadLines($file
)) {
        $lc++
        $m = $regex.Match($line)
        if ($m.Success) {
            $g = $m.Groups
            $g[1].Value + $lc + $g[2].Value
        } else { $line }
    }}
    [IO.File]::WriteAllLines($file
, $updatedLines, [Text.Encoding]::ASCII)

Editor's note: This is a follow-up question to Iterate a backed up ascii text file, find all instances of {LINE2 1-9999} replace with {LINE2 "line number the code is on"}. Overwrite. Faster?

The evolution of this question from youngest to oldest: 1. 54757890 2. 54737787 3. 54712715 4. 54682186

Update: I've used @mklement0 regex solution.


Solution

  • switch -Regex -File $file {
      '^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$' { $Matches[1] + ++$lc + $Matches[2] }
      default { ++$lc; $_ }
    }
    
    • Given that regex ^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$ contains only 2 capture groups - the part of the line before the number to replace (\d+) and the part of the line after, you must reference these groups with indices 1 and 2 into the automatic $Matches variable in the output (not 2 and 3).

      • Note that (?:...) is a non-capturing group, so by design it isn't reflected in $Matches.
    • Instead of reading the file with [IO.File]::ReadAllLines($file), I'm using the -File option with switch, which directly reads the lines from file $file.

    • The ++$lc inside default { ++$lc; $_ } ensures that the line counter is also incremented for non-matching lines before passing the line at hand through ($_).


    Performance notes

    • You can improve the performance slightly with the following obscure optimization:

      # Enclose the switch statement in & { ... } to speed it up slightly.
      $updatedLines = & { switch -Regex -File ... }
      
    • With high iteration counts (a large number of lines), using a precompiled [regex] instance rather than a string literal that PowerShell converts to a regex behind the scenes can speed things up further - see benchmarks below.

    • Additionally, if case-sensitive matching is sufficient, you can squeeze out a little more performance by adding the -CaseSensitive option to the switch statement.

    • At a high level, what makes the solution fast is the use of switch -File to process the lines, and, generally, the use of .NET types for file I/O (rather than cmdlets) (IO.File]::WriteAllLines() in this case, as shown in the question) - see also this related answer.

      • That said, marsze's answer offers a highly optimized foreach loop approach based on a precompiled regex that is faster with higher iteration counts - it is, however, more verbose.

    Benchmarks

    • The following code compares the performance of this answer's switch approach with marsze's foreach approach.

    • Note that in order to make the two solutions fully equivalent, the following tweaks were made:

      • The & { ... } optimization was added to the switch command as well.
      • The IgnoreCase and CultureInvariant options were added to the foreach approach to match the options PS regexes implicitly use.

    Instead of a 6-line sample file, performance is tested with a 600-line, a 3,000 and a 30,000-line file respectively, so as to show the effects of the iteration count on performance.

    100 runs are being averaged.

    Sample results from my Windows 10 machine running Windows PowerShell v5.1 - the absolute times aren't important, but hopefully the relative performance shown in the Factor column is generally representative:

    VERBOSE: Averaging 100 runs with a 600-line file of size 0.03 MB...
    
    Factor Secs (100-run avg.) Command
    ------ ------------------- -------
    1.00   0.023               # switch -Regex -File with regex string literal...
    1.16   0.027               # foreach with precompiled regex and [regex].Match...
    1.23   0.028               # switch -Regex -File with precompiled regex...
    
    
    VERBOSE: Averaging 100 runs with a 3000-line file of size 0.15 MB...
    
    Factor Secs (100-run avg.) Command
    ------ ------------------- -------
    1.00   0.063               # foreach with precompiled regex and [regex].Match...
    1.11   0.070               # switch -Regex -File with precompiled regex...
    1.15   0.073               # switch -Regex -File with regex string literal...
    
    
    VERBOSE: Averaging 100 runs with a 30000-line file of size 1.47 MB...
    
    Factor Secs (100-run avg.) Command
    ------ ------------------- -------
    1.00   0.252               # foreach with precompiled regex and [regex].Match...
    1.24   0.313               # switch -Regex -File with precompiled regex...
    1.53   0.386               # switch -Regex -File with regex string literal...
    

    Note how at lower iteration counts switch -regex with a string literal is fastest, but at around 1,500 lines the foreach solution with a precompiled [regex] instance starts to get faster; using a precompiled [regex] instance with switch -regex pays off to a lesser degree, only with higher iteration counts.

    Benchmark code, using the Time-Command function:

    # Sample file content (6 lines)
    $fileContent = @'
    TITLE %TIME%   NO "%zmyapps1%\*.*" ARCHIVE ATTRIBUTE   LINE2 1243
    TITLE %TIME%   DOC/SET YQJ8   LINE2 1887
    SET ztitle=%TIME%: WINFOLD   LINE2 2557
    TITLE %TIME%   _*.* IN WINFOLD   LINE2 2597
    TITLE %TIME%   %%ZDATE1%% YQJ25   LINE2 3672
    TITLE %TIME%   FINISHED. PRESS ANY KEY TO SHUTDOWN ... LINE2 4922
    
    '@
    
    # Determine the full path to a sample file.
    # NOTE: Using the *full* path is a *must* when calling .NET methods, because
    #       the latter generally don't see the same working dir. as PowerShell.
    $file = "$PWD/test.bat"
    
    # Note: input is the number of 6-line blocks to write to the sample file,
    #       which amounts to 600 vs. 3,000 vs. 30,0000 lines.
    100, 500, 5000 | % { 
    
      # Create the sample file with the sample content repeated N times.
      $repeatCount = $_ 
      [IO.File]::WriteAllText($file, $fileContent * $repeatCount)
    
      # Warm up the file cache and count the lines.
      $lineCount = [IO.File]::ReadAllLines($file).Count
    
      # Define the commands to compare as an array of scriptblocks.
      $commands =
        { # switch -Regex -File with regex string literal
          & { 
            $i = 0
            $updatedLines = switch -Regex -File $file {
              '^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$' { $Matches[1] + ++$i + $Matches[2] }
              default { ++$i; $_ }
            } 
            [IO.File]::WriteAllLines($file, $updatedLines, [text.encoding]::ASCII)
          }
        }, { # switch -Regex -File with precompiled regex
          & {
            $i = 0
            $regex = [Regex]::new('^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$', 'Compiled, IgnoreCase, CultureInvariant')
            $updatedLines = switch -Regex -File $file {
              $regex { $Matches[1] + ++$i + $Matches[2] }
              default { ++$i; $_ }
            } 
            [IO.File]::WriteAllLines($file, $updatedLines, [text.encoding]::ASCII)
          }
        }, { # foreach with precompiled regex and [regex].Match
          & {
            $regex = [Regex]::new('^(.*? (?:AROUND LINE|LINE2) )\d+(.*)$', 'Compiled, IgnoreCase, CultureInvariant')
            $i = 0
            $updatedLines = foreach ($line in [IO.File]::ReadLines($file)) {
                $i++
                $m = $regex.Match($line)
                if ($m.Success) {
                    $g = $m.Groups
                    $g[1].Value + $i + $g[2].Value
                } else { $line }
            }
            [IO.File]::WriteAllLines($file, $updatedLines, [Text.Encoding]::ASCII)    
          }
        }
    
      # How many runs to average.
      $runs = 100
    
      Write-Verbose -vb "Averaging $runs runs with a $lineCount-line file of size $('{0:N2} MB' -f ((Get-Item $file).Length / 1mb))..."
    
      Time-Command -Count $runs -ScriptBlock $commands | Out-Host
    
    }