Search code examples
powershellasynchronousparallel-processingstart-processstart-job

PowerShell, test the performance/efficiency of asynchronous tasks with Start-Job and Start-Process


I'm curious to test out the performance/usefulness of asynchronous tasks in PowerShell with Start-ThreadJob, Start-Job and Start-Process. I have a folder with about 100 zip files and so came up with the following test:

New-Item "000" -ItemType Directory -Force   # Move the old zip files in here
foreach ($i in $zipfiles) {
    $name = $i -split ".zip"
    Start-Job -scriptblock {
        7z.exe x -o"$name" .\$name
        Move-Item $i 000\ -Force
        7z.exe a $i .\$name\*.*
    }
}

The problem with this is that it would start jobs for all 100 zip, which would probably be too much, so I want to set a value $numjobs, say 5, which I can change, such that only $numjobs will be started at the same time, and then the script will check for all 5 of the jobs ending before the next block of 5 will start. I'd like to then watch the CPU and memory depending upon the value of $numjobs

How would I tell a loop only to run 5 times, then wait for the Jobs to finish before continuing?

I see that it's easy to wait for jobs to finish

$jobs = $commands | Foreach-Object { Start-ThreadJob $_ }
$jobs | Receive-Job -Wait -AutoRemoveJobchange

but how might I wait for Start-Process tasks to end?

Although I would like to use Parallel-ForEach, the Enterprises that I work in will be solidly tied to PowerShell 5.1 for the next 3-4 years I expect with no chance to install PowerShell 7.x (although I would be curious for myself to test with Parallel-ForEach on my home system to compare all approaches).


Solution

  • ForEach-Object -Parallel and Start-ThreadJob have built-in functionalities to limit the number of threads that can run at the same time, the same applies for Runspace with their RunspacePool which is what is used behind the scenes by both cmdlets.

    Start-Job does not offer such functionality because each Job runs in a separate process as opposed to the cmdlets mentioned before which run in different threads all in the same process. I would also personally not consider it as a parallelism alternative, it is pretty slow and in most cases a linear loop will be faster than it. Serialization and deserialization can be a problem in some cases too.

    How to limit the number of running threads?

    Both cmdlets offer the -ThrottleLimit parameter for this.

    How would the code look?

    $dir = (New-Item "000" -ItemType Directory -Force).FullName
    
    # ForEach-Object -Parallel
    $zipfiles | ForEach-Object -Parallel {
        $name = [IO.Path]::GetFileNameWithoutExtension($_)
        7z.exe x -o $name .\$name
        Move-Item $_ $using:dir -Force
        7z.exe a $_ .\$name\*.*
    } -ThrottleLimit 5
    
    # Start-ThreadJob
    $jobs = foreach ($i in $zipfiles) {
        Start-ThreadJob {
            $name = [IO.Path]::GetFileNameWithoutExtension($using:i)
            7z.exe x -o $name .\$name
            Move-Item $using:i $using:dir -Force
            7z.exe a $using:i .\$name\*.*
        } -ThrottleLimit 5
    }
    $jobs | Receive-Job -Wait -AutoRemoveJob
    

    How to achieve the same having only PowerShell 5.1 available and no ability to install new modules?

    The RunspacePool offer this same functionality, either with it's .SetMaxRunspaces(Int32) Method or by targeting one of the RunspaceFactory.CreateRunspacePool overloads offering a maxRunspaces limit as argument.

    How would the code look?

    $dir   = (New-Item "000" -ItemType Directory -Force).FullName
    $limit = 5
    $iss   = [initialsessionstate]::CreateDefault2()
    $pool  = [runspacefactory]::CreateRunspacePool(1, $limit, $iss, $Host)
    $pool.ThreadOptions = [Management.Automation.Runspaces.PSThreadOptions]::ReuseThread
    $pool.Open()
    
    $tasks  = foreach ($i in $zipfiles) {
        $ps = [powershell]::Create().AddScript({
            param($path, $dir)
    
            $name = [IO.Path]::GetFileNameWithoutExtension($path)
            7z.exe x -o $name .\$name
            Move-Item $path $dir -Force
            7z.exe a $path .\$name\*.*
        }).AddParameters(@{ path = $i; dir = $dir })
        $ps.RunspacePool = $pool
    
        @{ Instance = $ps; AsyncResult = $ps.BeginInvoke() }
    }
    
    foreach($task in $tasks) {
        $task['Instance'].EndInvoke($task['AsyncResult'])
        $task['Instance'].Dispose()
    }
    $pool.Dispose()
    

    Note that for all examples, it's unclear if the 7zip code is correct or not, this answer attempts to demonstrate how async is done in PowerShell not how to zip files / folders.


    Below is a helper function that can simplify the process of parallel invocations, tries to emulate ForEach-Object -Parallel and is compatible with PowerShell 5.1, though shouldn't be taken as a robust solution:

    NOTE This Q&A offers a much better and robust alternative to below function.

    using namespace System.Management.Automation
    using namespace System.Management.Automation.Runspaces
    using namespace System.Collections.Generic
    
    function Invoke-Parallel {
        [CmdletBinding()]
        param(
            [Parameter(Mandatory, ValueFromPipeline, DontShow)]
            [object] $InputObject,
    
            [Parameter(Mandatory, Position = 0)]
            [scriptblock] $ScriptBlock,
    
            [Parameter()]
            [int] $ThrottleLimit = 5,
    
            [Parameter()]
            [hashtable] $ArgumentList
        )
    
        begin {
            $iss = [initialsessionstate]::CreateDefault2()
            if($PSBoundParameters.ContainsKey('ArgumentList')) {
                foreach($argument in $ArgumentList.GetEnumerator()) {
                    $iss.Variables.Add([SessionStateVariableEntry]::new($argument.Key, $argument.Value, ''))
                }
            }
            $pool  = [runspacefactory]::CreateRunspacePool(1, $ThrottleLimit, $iss, $Host)
            $tasks = [List[hashtable]]::new()
            $pool.ThreadOptions = [PSThreadOptions]::ReuseThread
            $pool.Open()
        }
        process {
            try {
                $ps = [powershell]::Create().AddScript({
                    $args[0].InvokeWithContext($null, [psvariable]::new("_", $args[1]))
                }).AddArgument($ScriptBlock.Ast.GetScriptBlock()).AddArgument($InputObject)
    
                $ps.RunspacePool = $pool
                $invocationInput = [PSDataCollection[object]]::new(1)
                $invocationInput.Add($InputObject)
    
                $tasks.Add(@{
                    Instance    = $ps
                    AsyncResult = $ps.BeginInvoke($invocationInput)
                })
            }
            catch {
                $PSCmdlet.WriteError($_)
            }
        }
        end {
            try {
                foreach($task in $tasks) {
                    $task['Instance'].EndInvoke($task['AsyncResult'])
                    if($task['Instance'].HadErrors) {
                        $task['Instance'].Streams.Error
                    }
                    $task['Instance'].Dispose()
                }
            }
            catch {
                $PSCmdlet.WriteError($_)
            }
            finally {
                if($pool) { $pool.Dispose() }
            }
        }
    }
    

    An example of how it works:

    # Hashtable Key becomes the Variable Name inside the Runspace!
    $outsideVariables = @{ Message = 'Hello from {0}' }
    0..10 | Invoke-Parallel {
        "[Item $_] - " + $message -f [runspace]::DefaultRunspace.InstanceId
        Start-Sleep 5
    } -ArgumentList $outsideVariables -ThrottleLimit 3