Search code examples
powershellperformancehashtable

Hashtable of Hashtable - Performance issue


I'm looking for some performance advise, the code below works fine and it serves its purpose however it's terribly slow.

The result of the code is a hashtable of hashtables that I need use as lookup table later on.

Background:

  • $ccDb is an array consisting of around 200k items, the properties are companyCd, costCenterNbr, costCenterShortNm, costCenterLongDescr.
  • Each property has to be trimmed (please don't ask me to trim my Db, sadly I can't).
  • costCenterNbr is contained on companyCd, meaning, each companyCd can have multiple costCenterNbr.
  • companyCd can contain X amount of costCenterNbr.
  • costCenterNbr have unique value, same for companyCd.
  • costCenterShortNm and costCenterLongDescr are correlated with costCenterNbr

The Issue:

This map has to be constructed on each run of my script because the information is taken from SQL tables (which changes all the time). Building this map takes sometimes up to 15 minutes on a pretty good server.

The question:

Do you see a way this code could be improved for a faster / more efficient execution?

$ccMap = @{}
foreach ($line in $ccDb) {
    $companyCd = $line.companyCd.trim()
    $costCenterNbr = $line.costCenterNbr.trim()
    $costCenterShortNm = $line.CostCenterShortNm.trim()
    $costCenterLongDescr = $line.CostCenterLongDescr.trim()

    $coceMap = @{
        $costCenterNbr = @{
            shortDesc = $costCenterShortNm
            longDesc  = $costCenterLongDescr
        }
    }

    if ($ccMap.ContainsKey($companyCd)) {
        $ccMap[$companyCd] += $coceMap
    }
    else {
        $ccMap.Add($companyCd, $coceMap)
    }
}

I'm sorry for the long explanation, but I feel like it's better to give the most information up front. Any help is very much appreciated.


Edit 1

Adding measurements for reference:

measure1

And total key sum:

measure2


Edit 2 (Solved)

Thanks a lot Mathias R. Jessen, here are the measurements of his excellent code. From 7 minutes to 5 seconds!

measure1


Solution

  • Don't use += in tight loops

    Here's your biggest sink:

        $ccMap[$companyCd] += $coceMap
    

    When you add one hashtable to another using + (or += for that matter), PowerShell creates a whole new hashtable:

    # Create two different hashtables
    $A = @{ Key1 = 'Value1' }
    $B = @{ Key2 = 'Value2' }
    
    # Let's save a second reference to the first table
    $remember = $A
    
    # Now let's use += to merge the two:
    $A += $B
    

    Run this and you'll find $B and $remember are unchanged, but $A has both keys - and must therefore be a new one.

    To get around this performance penalty, skip the construction of $coceMap completely, and reverse the order (construct hashtable first if not present, then assign):

    $ccMap=@{}
    
    foreach($line in $ccDb)
    {
        $companyCd=$line.companyCd.trim()
        $costCenterNbr=$line.costCenterNbr.trim()
        $costCenterShortNm=$line.CostCenterShortNm.trim()
        $costCenterLongDescr=$line.CostCenterLongDescr.trim()
    
        # Create new hashtable if none exist, otherwise retrieve the existing one
        if($ccMap.ContainsKey($companyCd))
        {
            $coceMap = $ccMap[$companyCd]
        }
        else
        {
            $coceMap = $ccMap[$companyCd] = @{}
        }
        
        $coceMap[$costCenterNbr] = @{
            shortDesc=$costCenterShortNm
            longDesc=$costCenterLongDescr
        }
    }
    

    Benchmarking +=

    Here's a simplified example of the difference against 10000 items with 50 disctinct keys:

    $data = @(
        1..10000 |Select-Object @{Name='Company';Expression={Get-Random -Maximum 50}},@{Name='CostCenter';Expression={Get-Random}}
    )
    
    @(
        Measure-Command {
            $map = @{}
    
            foreach($line in $data){
                $entry = @{
                    $line.CostCenter = @{
                        Value = 123
                    }
                }
    
                if($map.ContainsKey($line.Company)){
                    $map[$line.Company] += $entry
                }
                else {
                    $map[$line.Company] = $entry
                }
            }
        }
    
        Measure-Command {
            $map = @{}
    
            foreach($line in $data){
                if($map.ContainsKey($line.Company)){
                    $entry = $map[$line.Company]
                }
                else {
                    $entry = $map[$line.Company] = @{}
                }
    
                $entry[$line.CostCenter] = @{
                    Value = 123
                }
            }
        }
    ) |select TotalMilliseconds
    

    Which on my laptop gives:

    TotalMilliseconds
    -----------------
             306.4218
              47.8164
    

    How to identify time sinks like this in general?

    There are a number of ways to profile the runtime behavior of PowerShell, but here's my personal first choice:

    1. Install PSProfiler (Disclaimer: I'm the maintainer of PSProfiler):
      • Install-Module PSProfiler -Scope CurrentUser
    2. Use Measure-Script the same way you would Measure-Command:
    Measure-Script {
        $map = @{}
    
        foreach($line in $data){
            $entry = @{
                $line.CostCenter = @{
                    Value = 123
                }
            }
    
            if($map.ContainsKey($line.Company)){
                $map[$line.Company] += $entry
            }
            else {
                $map[$line.Company] = $entry
            }
        }
    }
    
    1. Wait for the code to finish
    2. Review the output:
    
        Anonymous ScriptBlock
    
    
          Count  Line       Time Taken Statement
          -----  ----       ---------- ---------
              0     1    00:00.0000000 {
              1     2    00:00.0000187     $map = @{}
              0     3    00:00.0000000
              0     4    00:00.0000000     foreach($line in $data){
          10000     5    00:00.0635585         $entry = @{
              0     6    00:00.0000000             $line.CostCenter = @{
              0     7    00:00.0000000                 Value = 123
              0     8    00:00.0000000             }
              0     9    00:00.0000000         }
              0    10    00:00.0000000
              0    11    00:00.0000000         if($map.ContainsKey($line.Company)){
           9950    12    00:00.3965227             $map[$line.Company] += $entry
              0    13    00:00.0000000         }
              0    14    00:00.0000000         else {
             50    15    00:00.0002810             $map[$line.Company] = $entry
              0    16    00:00.0000000         }
              0    17    00:00.0000000     }
              0    18    00:00.0000000 }
    

    Observe that line 12 took the up the most total execution time - significantly more than any other:

           9950    12    00:00.3965227             $map[$line.Company] += $entry