Search code examples
powershellsortingscriptingmergesortsort-object

Custom Powershell sort function


I have a giant array of 1M+ names in it, some are alpha numeric some are just alpha.

CSV:
id,firstname,lastname,email,email2,profession
100,Andeee,Michella,Andeee.Michella@yopmail.com,Andeee.Michella@gmail.com,police officer
101,Tybie,1Grobe,Tybie.Grobe@yopmail.com,Tybie.Grobe@gmail.com,worker
102,Fernande,Azeria,Fernande.Azeria@yopmail.com,Fernande.Azeria@gmail.com,developer
103,Lenna,Schenck,Lenna.Schenck@yopmail.com,Lenna.Schenck@gmail.com,police officer
104,4Marti,Brittani,Marti.Brittani@yopmail.com,Marti.Brittani@gmail.com,worker
105,Riannon,Aldric,Riannon.Aldric@yopmail.com,Riannon.Aldric@gmail.com,doctor
106,Corry,Nikaniki,Corry.Nikaniki@yopmail.com,Corry.Nikaniki@gmail.com,worker
107,Correy,Shama,Correy.Shama@yopmail.com,Correy.Shama@gmail.com,police officer
108,Marcy,Drus,Marcy.Drus@yopmail.com,Marcy.Drus@gmail.com,worker
109,Bill,Valerio,Bill.Valerio@yopmail.com,Bill.Valerio@gmail.com,worker

I don't want to use Sort-Oject or Sort for the entire array for it's taking way too long. This needs to be done in Powershell because of restrictions in my environment.

The array comes from a csv that I exported from another powershell job. ( I don't have access to the jobs code, just the results).

Here is the example that I created from the Java method I found. It failed with the following error: The script failed due to call depth overflow.

$array = @("Ryan", "Kelly", "Alex", "Kyle", "Riley")
mergeSort $array

write-host $array

function mergeSort
{

   param([string[]] $arr)

      if($arr.length -ge 2){
         
         #find mid-point
         $left_len = [math]::floor([int32]$arr.length/2)                                              
         
         #declare array size of left of mid-point
         $left = [string[]]::new($left_len);                                                                 

         #find mid-point
         $right_len = [math]::ceiling([int32]($arr.Length - $arr.Length /2))                
     
         #declare array size right of mid-point
         $right = [string[]]::new($right_len);                                                               

         for ($i = 0; $i -lt $left_len.Length; $i++){
            $left= $arr[$i]
         }#for loop

         for ($i = 0; $i -lt $right_len; $i++){
            $right = $arr[$i +$arr.Length/2]
         }
     }#if($arr.length -ge 2)

   mergeSort $left

   mergeSort $right

   merge ($arr, $left, $right)
}

function merge
{
    param ([string[]] $result,[string[]] $left, [string[]] $right)

    $i1 = 0

    $12 = 0

    for ($i = 0; $i -le $result.Length; $i++) {
      if($i2 -gt $right.Length -or ($i1 -lt $left.Length -and $left[$i1].CompareTo($right[$i2]) -lt 0)) {
         $result[$i] = $left[$i1]
          $i1++
       }
       else {
          $result[$i] = $right[$i2]
          $i2++
       }

   }

   $result.legnth

 }

This is latest solution I came up with following everyone's suggestions: I want to make this work in paralles but it throws errors:

$array = @('Ryan', 'Kelly', 'Alex', 'Kyle', 'Riley', '4test', 'test4', 'why', 'you', 'me', 'where', 'hello', 'jose', 'test', 
'Jelly', 'Plex', 'Cyle', 'Miley', '5test', '3test4', 'who', 'Bou', 'We', 'There', 'Yellow', 'Pose', 'West')

$type = ("System.Collections.Generic.SortedSet"+'`'+"2") -as "Type"
$type = $type.MakeGenericType( @( ("System.string" -as "Type"), ("system.string" -as "Type") ) )
$sortedArray = [Activator]::CreateInstance($type, 10000)

$a, $b = ($array | Split-Collection -Count ([int]$array.length/2) | %{ $_ -join ',' })

$firstCollection = $a.Split(",")
$secondCollection = $b.Split(",")

$counter = 0
$counterHalf = $array.Length/2

1..$counterHalf| ForEach {
    try {   
        $col1 = $firstCollection[$counter]
        $sortedArray.Add($col1, $counter)
    }
    catch { "Out of bound col 1" }

    try {    
        $col2 = $secondCollection[$counter]
        $sortedArray.Add($col2, $counter)
    }
    catch { "Out of bound col 2" }
    
    $counter++
}

$sortedArray


function Split-Collection {
    [CmdletBinding()]
    param(
        [Parameter(ValueFromPipeline=$true)] $Collection,
        [Parameter(Mandatory=$true)][ValidateRange(1, 247483647)][int] $Count)
    begin {
        $Ctr = 0
        $Arrays = @()
        $TempArray = @()
    }
    process {
        if (++$Ctr -eq $Count) {
            $Ctr = 0
            $Arrays += , @($TempArray + $_)
            $TempArray = @()
            return
        }
        $TempArray += $_
    }
    end {
        if ($TempArray) { $Arrays += , $TempArray }
        $Arrays
    }
}

Solution

  • FWIW, this is an answer to the original question about getting your Merge Sort code working. Unfortunately it's not very performant, so I don't know if it will actually help you with the wider problem of sorting your 1M+ rows...

    The Good News

    I made some tweaks to your original mergeSort that appear to fix it, at least for the sample array at the top of your question.

    The fixes were mostly typos - see a screenshot from BeyondCompare to see the changes I made:

    enter image description here

    The Bad News

    It's sloooooow.

    PS> $array = [string[]] (1..10000 | % { $_.ToString() }) 
    PS> measure-command {
        mergeSort $array
    }
    
    ...
    TotalMilliseconds : 11511.74
    

    compared to

    PS> $array = [string[]] (1..10000 | % { $_.ToString() }) 
    PS> measure-command {
        $array = $array | sort-object
    }
    
    ...
    TotalMilliseconds : 36.8607
    

    Maybe it performs better with the scale of data you're talking about, but I didn't have the patience to test it :-)

    The Ugly

    I also tweaked the code slightly so it sorts the left side before doing anything with the right side, which means it doesn't need to use as much memory.

    Here's the updated sample, for posterity.

    $ErrorActionPreference = "Stop";
    Set-StrictMode -Version "Latest";
    
    function mergeSort
    {
    
        param([string[]] $arr)
    
        if( $arr.length -gt 1 )
        {
    
            # sort the left side
            $left_len = [Math]::Floor([int32]$arr.length / 2)
            $left = [string[]]::new($left_len);                                                                 
            for( $i = 0; $i -lt $left_len; $i++ )
            {
                $left[$i] = $arr[$i]
            }
            mergeSort -arr $left
    
            # sort the right side
            $right_len = $arr.Length - $left_len
            $right = [string[]]::new($right_len);
            for( $i = 0; $i -lt $right_len; $i++ )
            {
                $right[$i] = $arr[$left_len + $i]
            }
            mergeSort -arr $right
    
            # merge the two sides
            merge -result $arr -left $left -right $right
    
        }
    
    }
    
    function merge
    {
        param ([string[]] $result,[string[]] $left, [string[]] $right)
    
        $i1 = 0
        $i2 = 0
    
        for ($i = 0; $i -lt $result.Length; $i++)
        {
            if( ($i1 -lt $left.Length) -and (($i2 -ge $right.Length) -or $left[$i1].CompareTo($right[$i2]) -lt 0) )
            {
                $result[$i] = $left[$i1]
                $i1++
            }
            else
            {
                $result[$i] = $right[$i2]
                $i2++
            }
        }
    
    }
    
    $array = [string[]] @("Ryan", "Kelly", "Alex", "Kyle", "Riley")
    mergeSort $array
    
    write-host $array
    

    One thing to specifically call out is casting the input array to a string:

    $array = [string[]] @("Ryan", "Kelly", "Alex", "Kyle", "Riley")
    

    Without the cast, $array is of type [System.Object[]] and what happens is PowerShell will create a new temporary [string[]] array internally, copy the values into that and then sort the internal array, but it won't assign the internal array back to $array.

    Try it without the cast to see that in action.