Search code examples
arrayspowershellcomparison-operators

powershell comparison operator contain vs large array


Would like to know if there is a chance to improve performance o following array searching. In the moment it is 16 seconds.

Measure-Command -Expression {

$a = @()
$b = @()
1..10000 | %{$a += $_}
1..10000 | %{$b += $_}

#Try to resize but still running 16 seconds
[array]::Resize([ref]$a,10000) 
[array]::Resize([ref]$b,10000)

foreach ($i in $a){
    if ($b -contains $i) {
        #write-host $i
    }
}

}


Solution

    1. += in case of arrays recreates the entire array: it copies the old contents into a new array along with the new element. It's arguably the worst possible method to populate a large array in PowerShell. Use += only if it's a one-time operation not inside a loop, or the array is small and the time doesn't matter.

      Populate the array directly:

      $a = 1..10000
      

      Or collect the output of foreach statement directly:

      $a = foreach ($i in 1..10000) { $i }
      

      Or use an ArrayList:

      $a = [Collections.ArrayList]@()
      foreach ($i in 1..10000) { $a.Add($i) >$null }
      

      Note, in this particular case it's faster to convert an existing array to an ArrayList:

      $a = [Collections.ArrayList]@(1..10000)
      

       

    2. Pipelining is a complex operation, it's several/many times slower than flow control statements like foreach (statement, not cmdlet), while, do.

       

    3. ScriptBlock (the code in curly braces { } in ForEach cmdlet, aliased as %) takes a lot of time to create the execution context for each element compared to the simple code inside.

       

    4. -contains checks each element until a match is found so the final number of iterations is $a.count * $b.count in the worst case or half as much on the average.

       

    The most efficient approach is to build a lookup table:

    • as a hashtable:

      $a = 1..10000
      $b = 1..10000
      
      $bIndex = @{}
      foreach ($i in $b) { $bIndex[$i] = $true }
      
      foreach ($i in $a) {
          if ($bIndex[$i]) {
              #write-host $i
          }
      }
      

      15 milliseconds on i7 CPU

    • as a HashSet (.NET 3.5 and newer, built-in since Win7, installable on XP):

      $a = 1..10000
      $b = 1..10000
      
      $bIndex = [Collections.Generic.HashSet[int]]$b
      
      foreach ($i in $a) {
          if ($bIndex.Contains($i)) {
              #write-host $i
          }
      }
      

      7 milliseconds on i7 CPU

    • intersect the arrays using a HashSet, then iterate the result:

      $a = 1..10000
      $b = 1..10000
      
      $inBoth = [Collections.Generic.HashSet[int]]$a
      $inBoth.IntersectWith([Collections.Generic.HashSet[int]]$b)
      foreach ($i in $inBoth) {
          #write-host $i
      }
      

      7 milliseconds on i7 CPU

      It'll be many times faster in case the contents of arrays differs, that is when the intersection is much smaller than the arrays.

    It's also possible to build a real index that contains a list of original item indices in the array - useful in case of duplicate values.