Search code examples
powershellfiledelimitedgroup-object

Powershell Performance tuning for aggregation operation on big delimited files


I have a delimited file with 350 columns. The delimiter is \034(Field separator). I have to extract a particular column value and find out the count of each distinct value of that column in the file. If the count of distinct value is greater or equal to 2, I need to output it to a file. The source file is 1GB. I have written the following command. It is very slow.

    Get-Content E:\Test\test.txt | Foreach {($_ -split '\034')[117]} | Group-Object -Property { $_ } | %{ if($_.Count -ge 2) { Select-Object -InputObject $_ -Property Name,Count} } | Export-csv -Path "E:\Test\test2.csv" -NoTypeInformation

Please help!


Solution

  • I suggest using a switch statement to process the input file quickly (by PowerShell standards):

    # Get an array of all the column values of interest.
    $allColValues = switch -File E:\Test\test.txt {
      default {  # each input line
        # For better performance with *literal* separators, 
        # use the .Split() *method*.
        # Generally, however, use of the *regex*-based -split *operator* is preferable.
        $_.Split([char] 0x1c)[117] # hex 0x1c is octal 034
      }
    }
    
    # Group the column values, and only output those that occur at least 
    # twice.
    $allColValues | Group-Object -NoElement | Where-Object Count -ge 2 |
      Select-Object Name, Count | Export-Csv E:\Test\test2.csv -NoTypeInformation
    

    Tip of the hat to Mathias R. Jessen for suggesting the -NoElement switch, which streamlines the Group-Object call by only maintaining abstract group information; that is, only the grouping criteria (as reflected in .Name, not also the individual objects that make up the group (as normally reflected in .Group) are returned via the output objects.


    As for what you tried:

    • Get-Content with line-by-line streaming in the pipeline is slow, both generally (the object-by-object passing introduces overhead) and, specifically, because Get-Content decorates each line it outputs with ETS (Extended Type System) metadata.

      • GitHub issue #7537 proposes adding a way to opt-out of this decoration.
      • At the expense of memory consumption and potentially additional work for line-splitting, the -Raw switch reads the entire file as a single, multi-line string, which is much faster.
    • Passing -Property { $_ } to Group-Object isn't necessary - just omit it. Without a -Property argument, the input objects are grouped as a whole.

    • Chaining Where-Object and Select-Object - rather than filtering via an if statement in a ForEach-Object call combined with multiple Select-Object calls - is not only conceptually clearer, but performs better.