How can I get the sum of a file from a substring and placing the sum on a specific position (different line) using powershell if have the following conditions:
Get the sum of the numbers from position 3 to 13 of a line that is starting with a character D. Place the sum on position 10 to 14 on the line that starts with the S
So for example, if i have this file:
F123trial text
DA00000038.95==xxx11
DA00000018.95==yyy11
DA00000018.95==zzzyy
S xxxxx
I want to get the sum of 38.95, 18.95 and 18.95 and then place the sum on position xxxxx under the line that starts with the S.
PowerShell's switch
statement has powerful, but little-known features that allow you to iterate over the lines of a file (-file
) and match lines by regular expressions (-regex
).
Not only is switch -file
convenient, it is also much faster than using cmdlets in a pipeline (see bottom section).
[double] $sum = 0
switch -regex -file file.txt {
# Note: The string to the left of each script block below ({ ... }),
# e.g., '^D', is the regex to match each line against.
# Inside the script blocks, $_ refers to the input line at hand.
# Extract number, add to sum, output the line.
'^D' { $sum += $_.Substring(2, 11); $_; continue }
# Summary line: place sum at character position 10, with 0-padding
# Note: `-replace ',', '.'` is only needed if your culture uses "," as the
# decimal mark.
'^S' { $_.Substring(0, 9) + '{0:000000000000000.00}' -f $sum -replace ',', '.'; continue }
# All other lines: pass them through.
default { $_ }
}
Note:
continue
in the script blocks short-circuits further matching for the line at hand; by contrast, if you used break
, no further lines would be processed.0
-left-padded number on the S
line at character position 10
.With your sample file, the above yields:
F123trial text
DA00000038.95==xxx11
DA00000018.95==yyy11
DA00000018.95==zzzyy
S 000000000000076.85
switch -file ...
to Get-Content ... | ForEach-Object ...
Running the following test script:
& {
# Create a sample file with 100K lines.
1..1e5 > ($tmpFile = [IO.Path]::GetTempFileName())
(Measure-Command { switch -file ($tmpFile) { default { $_ } } }).TotalSeconds,
(Measure-Command { get-content $tmpFile | % { $_ } }).TotalSeconds
Remove-Item $tmpFile
}
yields the following timings on my machine, for instance (the absolute numbers aren't important, but their ratio should give you a sense):
0.0578924 # switch -file
6.0417638 # Get-Content | ForEach-Object
That is, the pipeline-based solution is about 100 (!) times slower than the switch -file
solution.
Digging deeper:
Frode F. points out that Get-Content
is slow with large files - though its convenience makes it a popular choice - and mentions using the .NET Framework directly as an alternative:
Using [System.IO.File]::ReadAllLines()
; however, given that it reads the entire file into memory, that is only an option with smallish files.
Using [System.IO.StreamReader]
's ReadLine()
method in a loop.
However, use of the pipeline in itself, irrespective of the specific cmdlets used, introduces overhead. When performance matters - but only then - you should avoid it.
Here's an updated test that includes commands that use the .NET Framework methods, with and without the pipeline (the use of intrinsic .ForEach()
method requires PSv4+):
& {
# Create a sample file with 100K lines.
1..1e5 > ($tmpFile = [IO.Path]::GetTempFileName())
(Measure-Command { switch -file ($tmpFile) { default { $_ } } }).TotalSeconds
(Measure-Command { foreach ($line in [IO.File]::ReadLines((Convert-Path $tmpFile))) { $line } }).TotalSeconds
(Measure-Command {
$sr = [IO.StreamReader] (Convert-Path $tmpFile)
while(-not $sr.EndOfStream) { $sr.ReadLine() }
$sr.Close()
}).TotalSeconds
(Measure-Command { [IO.File]::ReadAllLines((Convert-Path $tmpFile)).ForEach({ $_ }) }).TotalSeconds
(Measure-Command { [IO.File]::ReadAllLines((Convert-Path $tmpFile)) | % { $_ } }).TotalSeconds
(Measure-Command { Get-Content $tmpFile | % { $_ } }).TotalSeconds
Remove-Item $tmpFile
}
Sample results, from fastest to slowest:
0.0124441 # switch -file
0.0365348 # [System.IO.File]::ReadLine() in foreach loop
0.0481214 # [System.IO.StreamReader] in a loop
0.1614621 # [System.IO.File]::ReadAllText() with .ForEach() method
0.2745749 # (pipeline) [System.IO.File]::ReadAllText() with ForEach-Object
0.5925222 # (pipeline) Get-Content with ForEach-Object
switch -file
is the fastest by a factor of around 3, followed by the no-pipeline .NET solutions; using .ForEach()
adds another factor of 3.
Simply introducing the pipeline (ForEach-Object
instead of .ForEach()
) adds another factor of 2; finally, using the pipeline with Get-Content
and ForEach-Object
adds another factor of 2.