Search code examples
powershellperformancepipelineenumeration

Enumerating large powershell object variable (1 million plus members)


I'm processing large amounts of data and after pulling the data and manipulating it, I have the results stored in memory in a variable.

I now need to separate this data into separate variables and this was easily done via piping and using a where-object, but this has slowed down now that I have much more data (1 million plus members). Note: it takes about 5+ minutes.

$DCEntries = $DNSQueries | ? {$_.ClientIP -in $DCs.ipv4address -Or $_.ClientIP -eq '127.0.0.1'}
$NonDCEntries = $DNSQueries | ? {$_.ClientIP -notin $DCs.ipv4address -And $_.ClientIP -ne '127.0.0.1'} 

#Note: 
#$DCs is an array of 60 objects of type Microsoft.ActiveDirectory.Management.ADDomainController, with two properties:  Name, ipv4address
#$DNSQueries is a collection of pscustomobjects that has 6 properties, all strings.

I immediately realize I'm enumerating $DNSQueries (the large object) twice, which is obviously costing me some time. As such I decided to go about this a different way enumerating it once and using a Switch statement, but this seems to have exponentially caused the timing to INCREASE, which is not what I was going for.

$DNSQueries | ForEach-Object {
    Switch ($_) {
        {$_.ClientIP -in $DCs.ipv4address -Or $_.ClientIP -eq '127.0.0.1'} {
            # Query is from a DC
            $DCEntries += $_
        }
        default {
            # Query is not from DC
            $NonDCEntries += $_
        }
    }
}

I'm wondering if someone can explain to me why the second code takes so much more time. Further, perhaps offer a better way to accomplish what I want.

Is the Foreach-Object and/or appending of the sub variables costing that much time?


Solution

  • ForEach-Object is actually the slowest way to enumerate a collection but also there is a follow-up switch with a script block condition causing even more overhead.

    If the collection is already in memory, nothing can beat a foreach loop for linear enumeration.

    As for your biggest problem, the use of += to add items to an array and it being a collection of a fixed size. PowerShell has to create a new array and copy all items each time a new item is added, this is very inefficient. See this answer as well as this awesome documention for more details.

    In this case you can combine a List<T> with PowerShell's explicit assignment.

    $NonDCEntries = [Collections.Generic.List[object]]::new()
    
    $DCEntries = foreach($item in $DNSQueries) {
        if($item.ClientIP -eq '127.0.0.1' -or $item.ClientIP -in $DCs.IPv4Address) {
            $item
            continue
        }
    
        $NonDCEntries.Add($item)
    }
    

    To put into perspective how exponentially bad += to an array is, this a performance test comparing PowerShell explicit assignment from a loop and adding to a List<T> versus adding to an Array.

    $tests = @{
        'PowerShell Explicit Assignment' = {
            param($count)
    
            $result = foreach($i in 1..$count) {
                $i
            }
        }
        '.Add(..) to List<T>' = {
            param($count)
    
            $result = [Collections.Generic.List[int]]::new()
            foreach($i in 1..$count) {
                $result.Add($i)
            }
        }
        '+= Operator to Array' = {
            param($count)
    
            $result = @()
            foreach($i in 1..$count) {
                $result += $i
            }
        }
    }
    
    5000, 10000, 25000, 50000, 75000, 100000 | ForEach-Object {
        $groupresult = foreach($test in $tests.GetEnumerator()) {
            $totalms = (Measure-Command { & $test.Value -Count $_ }).TotalMilliseconds
    
            [pscustomobject]@{
                CollectionSize    = $_
                Test              = $test.Key
                TotalMilliseconds = [math]::Round($totalms, 2)
            }
    
            [GC]::Collect()
            [GC]::WaitForPendingFinalizers()
        }
    
        $groupresult = $groupresult | Sort-Object TotalMilliseconds
        $groupresult | Select-Object *, @{
            Name       = 'RelativeSpeed'
            Expression = {
                $relativespeed = $_.TotalMilliseconds / $groupresult[0].TotalMilliseconds
                [math]::Round($relativespeed, 2).ToString() + 'x'
            }
        }
    }
    

    Below the test results:

    CollectionSize Test                           TotalMilliseconds RelativeSpeed
    -------------- ----                           ----------------- -------------
              5000 PowerShell Explicit Assignment              0.56 1x
              5000 .Add(..) to List<T>                         7.56 13.5x
              5000 += Operator to Array                     1357.74 2424.54x
             10000 PowerShell Explicit Assignment              0.77 1x
             10000 .Add(..) to List<T>                        18.20 23.64x
             10000 += Operator to Array                     5411.23 7027.57x
             25000 PowerShell Explicit Assignment              1.39 1x
             25000 .Add(..) to List<T>                        47.14 33.91x
             25000 += Operator to Array                    26168.67 18826.38x
             50000 PowerShell Explicit Assignment              3.49 1x
             50000 .Add(..) to List<T>                        97.38 27.9x
             50000 += Operator to Array                   129537.09 37116.64x
             75000 PowerShell Explicit Assignment             14.59 1x
             75000 .Add(..) to List<T>                       243.47 16.69x
             75000 += Operator to Array                   247419.68 16958.17x
            100000 PowerShell Explicit Assignment             14.85 1x
            100000 .Add(..) to List<T>                       177.13 11.93x
            100000 += Operator to Array                   473824.71 31907.39x