Search code examples
powershellperformanceloopsjoin

Efficiently merge large object datasets having mulitple matching keys


In a Powershell script, I have two data sets that have multiple columns. Not all these columns are shared.

For example, data set 1:

A B    XY   ZY  
- -    --   --  
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6

and data set 2:

A B    ABC  GH  
- -    ---  --  
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8

I want to merge these two dataset, specifying which columns act as key (A and B in my simple case). The expected result is:

A B    XY   ZY   ABC  GH  
- -    --   --   ---  --  
1 val1 foo1 bar1          
2 val2 foo2 bar2          
3 val3 foo3 bar3 foo3 bar3
4 val4 foo4 bar4 foo4 bar4
5 val5 foo5 bar5 foo5 bar5
6 val6 foo6 bar6 foo6 bar6
7 val7           foo7 bar7
8 val8           foo8 bar8

The concept is very similar to a SQL cross join query.

I have been able to successfully write a function that merge objects. Unfortunately, the duration of the computation is exponential.

If I generate my data sets using :

$dsLength = 10
$dataset1 = 0..$dsLength | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}

I get these results:

  • $dsLength = 10 ==> 33ms (fine)
  • $dsLength = 100 ==> 89ms (fine)
  • $dsLength = 1000 ==> 1563ms (acceptable)
  • $dsLength = 5000 ==> 35764ms (too much)
  • $dsLength = 10000 ==> 138047ms (too much)
  • $dsLength = 20000 ==> 573614ms (far too much)

How can I merge datasets efficiently when data sets are large (my target is around 20K items) ?

Right now, I have these function defined :

function Merge-Objects{
    param(
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset1,
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset2,
        [Parameter()]
        [string[]]$Properties
    )

    $result = @()

    $ds1props = $Dataset1 | gm -MemberType Properties
    $ds2props = $Dataset2 | gm -MemberType Properties
    $ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
    $ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }

    foreach($row1 in $Dataset1){
        $result += $row1
        $ds2propsNotInDs1Props | % {
            $row1 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
        }
    }

    foreach($row2 in $Dataset2){
        $existing = foreach($candidate in $result){
            $match = $true
            foreach($prop in $Properties){
                if(-not ($row2.$prop -eq $candidate.$prop)){
                    $match = $false                   
                    break                  
                }
            }
            if($match){
                $candidate
                break
            }
        }
        if(!$existing){
            $ds1propsNotInDs2Props | % {
                $row2 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
            }
            $result += $row2
        }else{
            $ds2propsNotInDs1Props | % {
                $existing.$($_.Name) = $row2.$($_.Name)
            }

        }
    }

    $result
}

I call these function like this :

Measure-Command -Expression {

    $data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B" 

}

My feeling is that the slowness is due to the second loop, where I try to match an existing row in each iteration

[Edit] Second approach using a hash as index. Surprisingly, it's event slower than first try

$dsLength = 1000
$dataset1 = 0..$dsLength | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}

function Get-Hash{
    param(
        [Parameter(Mandatory=$true)]
        [object]$InputObject,
        [Parameter()]
        [string[]]$Properties    
    )

    $InputObject | Select-object $properties | Out-String
}


function Merge-Objects{
    param(
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset1,
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset2,
        [Parameter()]
        [string[]]$Properties
    )

    $result = @()
    $index = @{}

    $ds1props = $Dataset1 | gm -MemberType Properties
    $ds2props = $Dataset2 | gm -MemberType Properties
    $allProps = $ds1props + $ds2props | select -Unique

    $ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
    $ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }

    $ds1index = @{}

    foreach($row1 in $Dataset1){
        $tempObject = new-object psobject
        $result += $tempObject
        $ds2propsNotInDs1Props | % {
            $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
        }
        $ds1props | % {
            $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row1.$($_.Name)
        }

        $hash1 = Get-Hash -InputObject $row1 -Properties $Properties
        $ds1index.Add($hash1, $tempObject)

    }

    foreach($row2 in $Dataset2){
        $hash2 = Get-Hash -InputObject $row2 -Properties $Properties

        if($ds1index.ContainsKey($hash2)){
            # merge object
            $existing = $ds1index[$hash2]
            $ds2propsNotInDs1Props | % {
                $existing.$($_.Name) = $row2.$($_.Name)
            }
            $ds1index.Remove($hash2)

        }else{
            # add object
            $tempObject = new-object psobject
            $ds1propsNotInDs2Props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
            }
            $ds2props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row2.$($_.Name)
            }
            $result += $tempObject
        }
    }

    $result
}

Measure-Command -Expression {

    $data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B" 

}

[Edit2] Putting Measure-Commands around the two loops show that event the first loop is yet slow. Actually the first loop take more than 50% of the total time


Solution

  • I agree with @Matt. Use a hashtable -- something like the below. This should run in m + 2n rather than mn time.

    Timings on my system

    original Solution above

    #10    TotalSeconds      :   0.07788
    #100   TotalSeconds      :   0.37937
    #1000  TotalSeconds      :   5.25092
    #10000 TotalSeconds      : 242.82018
    #20000 TotalSeconds      : 906.01584
    

    This definitely looks O(n^2)

    Solution Below

    #10    TotalSeconds      :  0.094
    #100   TotalSeconds      :  0.425
    #1000  TotalSeconds      :  3.757
    #10000 TotalSeconds      : 45.652
    #20000 TotalSeconds      : 92.918
    

    This looks linear.

    Solution

    I used three techniques to increase the speed:

    1. Change over to a hashtable. This allows constant time lookups so that you don't have to have nested loops. This is the only change really needed to go from O(n^2) to linear time. It does have the disadvantage that there is more setup work done. So, the advantage of linear time won't be seen until the loop count is large enough to pay for the setup.
    2. Use ArrayList instead of a native array. Adding an item to a native array requires that the array be reallocated and all the items to be copied. So this is also an O(n^2) operation. Since this operation is being done at the engine level, the constant is very small, so it really won't make a difference until much later.
    3. Use PsObject.Copy to create the new object. This is a small optimization compared to the other two, but it cut the run time in half for me.

    --

    function Get-Hash{
        param(
            [Parameter(Mandatory=$true)]
            [object]$InputObject,
            [Parameter()]
            [string[]]$Properties    
        )
    
        $arr = [System.Collections.ArrayList]::new()
    
        foreach($p in $Properties) { $arr += $InputObject.$($p) }
    
        return ( $arr -join ':' )
    }
    
    function Merge-Objects{
        param(
            [Parameter(Mandatory=$true)]
            [object[]]$Dataset1,
            [Parameter(Mandatory=$true)]
            [object[]]$Dataset2,
            [Parameter()]
            [string[]]$Properties
        )
    
        $results = [System.Collections.ArrayList]::new()
    
        $ds1props = $Dataset1 | gm -MemberType Properties
        $ds2props = $Dataset2 | gm -MemberType Properties
        $ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
        $ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }
    
    
        $hash = @{}
        $Dataset2 | % { $hash.Add( (Get-Hash $_ $Properties), $_) }
    
        foreach ($row in $dataset1) {
    
            $key = Get-Hash $row $Properties
    
            $tempObject = $row.PSObject.Copy()
    
            if ($hash.containskey($key)) {
                $r2 = $hash[$key]
    
                $hash.remove($key)
                $ds2propsNotInDs1Props | % {
                    $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $r2.$($_.Name)
                }
    
            } else {
                $ds2propsNotInDs1Props | % {
                    $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
                }
            }
            [void]$results.Add($tempObject)
        }
    
        foreach ($row in $hash.values ) {
            # add missing dataset2 objects and extend
            $tempObject = $row.PSObject.Copy()
    
            $ds1propsNotInDs2Props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
            }
    
            [void]$results.Add($tempObject)
        }
    
        $results
    }
    
    ########
    
    $dsLength = 10000
    $dataset1 = 0..$dsLength | %{
        New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
    }
    $dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
        New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
    }
    
    Measure-Command -Expression {
    
        $data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B" 
    
    }