Search code examples
multithreadingpowershellcopy-item

Copy-item using Start-ThreadJob in Powershell


Off the back of this thread: Copy-item using invoke-async in Powershell I have the following:

@mklement0's method (Copied from and amended by from here) works, but because it creates a thread per-file is exceptionally slow and on my test system working with ~14,000 files consumed > 4GB of memory:

# This works but is INCREDIBLY SLOW because it creates a thread per file
 Create sample CSV file with 10 rows.
 $FileList = Join-Path ([IO.Path]::GetTempPath()) "tmp.$PID.csv"
 @'
 Foo,SrcFileName,DestFileName,Bar
 1,c:\tmp\a,\\server\share\a,baz
 2,c:\tmp\b,\\server\share\b,baz
 3,c:\tmp\c,\\server\share\c,baz
 4,c:\tmp\d,\\server\share\d,baz
 5,c:\tmp\e,\\server\share\e,baz
 6,c:\tmp\f,\\server\share\f,baz
 7,c:\tmp\g,\\server\share\g,baz
 8,c:\tmp\h,\\server\share\h,baz
 9,c:\tmp\i,\\server\share\i,baz
 10,c:\tmp\j,\\server\share\j,baz
 '@ | Set-Content $FileList

# How many threads at most to run concurrently.
 $NumCopyThreads = 8

Write-Host 'Creating jobs...'
$dtStart = [datetime]::UtcNow

# Import the CSV data and transform it to [pscustomobject] instances
# with only .SrcFileName and .DestFileName properties - they take
# the place of your original [fileToCopy] instances.
$jobs = Import-Csv $FileList | Select-Object SrcFileName, DestFileName | 
  ForEach-Object {
    # Start the thread job for the file pair at hand.
    Start-ThreadJob -ThrottleLimit $NumCopyThreads -ArgumentList $_ { 
        param($f) 
        [System.IO.Fileinfo]$DestinationFilePath = $f.DestFileName
        [String]$DestinationDir = $DestinationFilePath.DirectoryName
        if (-not (Test-path([Management.Automation.WildcardPattern]::Escape($DestinationDir)))) {
            new-item -Path $DestinationDir -ItemType Directory #-Verbose
        }
        copy-item -path $f.srcFileName -Destination $f.destFilename
        "Copied $($f.SrcFileName) to $($f.DestFileName)"
    }
  }

Write-Host "Waiting for $($jobs.Count) jobs to complete..."

# Synchronously wait for all jobs (threads) to finish and output their results
# *as they become available*, then remove the jobs.
# NOTE: Output will typically NOT be in input order.
Receive-Job -Job $jobs -Wait -AutoRemoveJob
Write-Host "Total time lapsed: $([datetime]::UtcNow - $dtStart)"

# Clean up the temp. file
Remove-Item $FileList

This article (the PowerShell Jobs section in particular) gave me the idea for splitting up the complete list into batches of 1000 files, and when it runs in my test case I get 15 threads (as I have ~14,500 files) but the threads only process the first file in each "chunk" and then stop:

<#
.SYNOPSIS
<Brief description>
For examples type:
Get-Help .\<filename>.ps1 -examples
.DESCRIPTION
Copys files from one path to another
.PARAMETER FileList
e.g. C:\path\to\list\of\files\to\copy.txt
.PARAMETER NumCopyThreads
default is 8 (but can be 100 if you want to stress the machine to maximum!)
.PARAMETER LogName
default is output.csv located in the same path as the Filelist
.EXAMPLE
to run using defaults just call this file:
.\CopyFilesToBackup
to run using anything else use this syntax:
.\CopyFilesToBackup -filelist C:\path\to\list\of\files\to\copy.txt -NumCopyThreads 20 -LogName C:\temp\backup.log -CopyMethod Runspace
.\CopyFilesToBackup -FileList .\copytest.csv -NumCopyThreads 30 -Verbose
.NOTES
#>

[CmdletBinding()] 
Param( 
    [String] $FileList = "C:\temp\copytest.csv", 
    [int] $NumCopyThreads = 8,
    [String] $LogName
) 

$filesPerBatch = 1000

$files = Import-Csv $FileList | Select-Object SrcFileName, DestFileName

$i = 0
$j = $filesPerBatch - 1
$batch = 1

Write-Host 'Creating jobs...'
$dtStart = [datetime]::UtcNow

$jobs = while ($i -lt $files.Count) {
    $fileBatch = $files[$i..$j]

    $jobName = "Batch$batch"
    Start-ThreadJob -Name $jobName -ThrottleLimit $NumCopyThreads -ArgumentList ($fileBatch) -ScriptBlock {
        param($filesInBatch)
        foreach ($f in $filesInBatch) {
            [System.IO.Fileinfo]$DestinationFilePath = $f.DestFileName
            [String]$DestinationDir = $DestinationFilePath.DirectoryName
            if (-not (Test-path([Management.Automation.WildcardPattern]::Escape($DestinationDir)))) {
                new-item -Path $DestinationDir -ItemType Directory -Verbose
            }
            copy-item -path $f.srcFileName -Destination $f.DestFileName -Verbose
        }
    } 

    $batch += 1
    $i = $j + 1
    $j += $filesPerBatch

    if ($i -gt $files.Count) {$i = $files.Count}
    if ($j -gt $files.Count) {$j = $files.Count}
}

Write-Host "Waiting for $($jobs.Count) jobs to complete..."

Receive-Job -Job $jobs -Wait -AutoRemoveJob
Write-Host "Total time lapsed: $([datetime]::UtcNow - $dtStart)"

I feel like I'm missing something obvious but I don't know what.

Can anyone help?


Solution

  • Change:

    Start-ThreadJob -Name $jobName -ThrottleLimit $NumCopyThreads -ArgumentList ($fileBatch) -ScriptBlock {
    

    to

    Start-ThreadJob -Name $jobName -ThrottleLimit $NumCopyThreads -ArgumentList (,$fileBatch) -ScriptBlock {
    

    Note the comma before $fileBatch in argument list.

    The reason this fixes it is because ArgumentList is expecting an array and gives each element to the parameters. You're trying to pass the entire array to the first parameter, which means you have to put your array inside an array.

    Apparently (this is news to me), Powershell will happily treat your string as a single item array in the foreach loop, which is why the first item is processed in each batch.