Search code examples
powershellsumsubstring

powershell get the sum of a specific substring position


How can I get the sum of a file from a substring and placing the sum on a specific position (different line) using powershell if have the following conditions:

Get the sum of the numbers from position 3 to 13 of a line that is starting with a character D. Place the sum on position 10 to 14 on the line that starts with the S

So for example, if i have this file:

F123trial   text
DA00000038.95==xxx11
DA00000018.95==yyy11
DA00000018.95==zzzyy
S        xxxxx

I want to get the sum of 38.95, 18.95 and 18.95 and then place the sum on position xxxxx under the line that starts with the S.


Solution

  • PowerShell's switch statement has powerful, but little-known features that allow you to iterate over the lines of a file (-file) and match lines by regular expressions (-regex).

    Not only is switch -file convenient, it is also much faster than using cmdlets in a pipeline (see bottom section).

    [double] $sum = 0
    
    switch -regex -file file.txt {
    
      # Note: The string to the left of each script block below ({ ... }), 
      #       e.g., '^D', is the regex to match each line against.
      #       Inside the script blocks, $_ refers to the input line at hand.
    
      # Extract number, add to sum, output the line.
      '^D' { $sum += $_.Substring(2, 11); $_; continue }
    
      # Summary line: place sum at character position 10, with 0-padding
      # Note: `-replace ',', '.'` is only needed if your culture uses "," as the
      #       decimal mark.
      '^S' { $_.Substring(0, 9) + '{0:000000000000000.00}' -f $sum -replace ',', '.'; continue }
      
      # All other lines: pass them through.
      default { $_ }
    
    }
    

    Note:

    • continue in the script blocks short-circuits further matching for the line at hand; by contrast, if you used break, no further lines would be processed.
    • Based on a later comment, I'm assuming you want an 18-character 0-left-padded number on the S line at character position 10.

    With your sample file, the above yields:

    F123trial   text
    DA00000038.95==xxx11
    DA00000018.95==yyy11
    DA00000018.95==zzzyy
    S        000000000000076.85
    

    Optional reading: Comparing the performance of switch -file ... to Get-Content ... | ForEach-Object ...

    Running the following test script:

    & {
      # Create a sample file with 100K lines.
      1..1e5 > ($tmpFile = [IO.Path]::GetTempFileName())
      (Measure-Command { switch -file ($tmpFile) { default { $_ } } }).TotalSeconds, 
      (Measure-Command { get-content $tmpFile | % { $_ }  }).TotalSeconds
      Remove-Item $tmpFile
    }
    

    yields the following timings on my machine, for instance (the absolute numbers aren't important, but their ratio should give you a sense):

    0.0578924   # switch -file
    6.0417638   # Get-Content | ForEach-Object
    

    That is, the pipeline-based solution is about 100 (!) times slower than the switch -file solution.


    Digging deeper:

    Frode F. points out that Get-Content is slow with large files - though its convenience makes it a popular choice - and mentions using the .NET Framework directly as an alternative:

    • Using [System.IO.File]::ReadAllLines(); however, given that it reads the entire file into memory, that is only an option with smallish files.

    • Using [System.IO.StreamReader]'s ReadLine() method in a loop.

    However, use of the pipeline in itself, irrespective of the specific cmdlets used, introduces overhead. When performance matters - but only then - you should avoid it.

    Here's an updated test that includes commands that use the .NET Framework methods, with and without the pipeline (the use of intrinsic .ForEach() method requires PSv4+):

    & {
      # Create a sample file with 100K lines.
      1..1e5 > ($tmpFile = [IO.Path]::GetTempFileName())
      
      (Measure-Command { switch -file ($tmpFile) { default { $_ } } }).TotalSeconds
      (Measure-Command { foreach ($line in [IO.File]::ReadLines((Convert-Path $tmpFile))) { $line } }).TotalSeconds
      (Measure-Command { 
        $sr = [IO.StreamReader] (Convert-Path $tmpFile)
        while(-not $sr.EndOfStream) { $sr.ReadLine() }
        $sr.Close() 
      }).TotalSeconds
      (Measure-Command { [IO.File]::ReadAllLines((Convert-Path $tmpFile)).ForEach({ $_ }) }).TotalSeconds
      (Measure-Command { [IO.File]::ReadAllLines((Convert-Path $tmpFile)) | % { $_ } }).TotalSeconds
      (Measure-Command { Get-Content $tmpFile | % { $_ }  }).TotalSeconds
      
      Remove-Item $tmpFile
    }
    

    Sample results, from fastest to slowest:

    0.0124441 # switch -file
    0.0365348 # [System.IO.File]::ReadLine() in foreach loop
    0.0481214 # [System.IO.StreamReader] in a loop
    0.1614621 # [System.IO.File]::ReadAllText() with .ForEach() method
    0.2745749 # (pipeline) [System.IO.File]::ReadAllText() with ForEach-Object
    0.5925222 # (pipeline) Get-Content with ForEach-Object
    

    switch -file is the fastest by a factor of around 3, followed by the no-pipeline .NET solutions; using .ForEach() adds another factor of 3. Simply introducing the pipeline (ForEach-Object instead of .ForEach()) adds another factor of 2; finally, using the pipeline with Get-Content and ForEach-Object adds another factor of 2.