Search code examples
powershellexport-to-csvpipeline

How to use Powershell Pipeline to Avoid Large Objects?


I'm using a custom function to essentially do a DIR command (recursive file listing) on an 8TB drive (thousands of files).

My first iteration was:

$results = $PATHS | % {Get-FolderItem -Path "$($_)" } | Select Name,DirectoryName,Length,LastWriteTime 
$results | Export-CVS -Path $csvfile -Force -Encoding UTF8 -NoTypeInformation -Delimiter "|"

This resulted in a HUGE $results variable and slowed the system down to a crawl by spiking the powershell process to use 99%-100% of the CPU as the processing went on.

I decided to use the power of the pipeline to WRITE to the CSV file directly (presumably freeing up the memory) instead of saving to an intermediate variable, and came up with this:

$PATHS | % {Get-FolderItem -Path "$($_)" } | Select Name,DirectoryName,Length,LastWriteTime | ConvertTo-CSV -NoTypeInformation -Delimiter "|" | Out-File -FilePath $csvfile -Force -Encoding UTF8

This seemed to be working fine (the CSV file was growing..and CPU seemed to be stable) but then abruptly stopped when the CSV file size hit ~200MB, and the error to the console was "The pipeline has been stopped".

I'm not sure the CSV file size had anything to do with the error message, but I'm unable to process this large directory with either method! Any suggestions on how to allow this process to complete successfully?


Solution

  • Get-FolderItem runs robocopy to list the files and converts its output into a PSObject array. This is a slow operation, which isn't required for the actual task, strictly speaking. Pipelining also adds big overhead compared to the foreach statement. In the case of thousands or hundreds of thousands repetitions that becomes noticeable.

    We can speed up the process beyond anything pipelining and standard PowerShell cmdlets can offer to write the info for 400,000 files on an SSD drive in 10 seconds.

    1. .NET Framework 4 or newer (included since Win8, installable on Win7/XP) IO.DirectoryInfo's EnumerateFileSystemInfos to enumerate the files in a non-blocking pipeline-like fashion;
    2. PowerShell 3 or newer as it's faster than PS2 overall;
    3. foreach statement which doesn't need to create ScriptBlock context for each item thus it's much faster than ForEach cmdlet
    4. IO.StreamWriter to write each file's info immediately in a non-blocking pipeline-like fashion;
    5. \\?\ prefix trick to lift the 260 character path length restriction;
    6. manual queuing of directories to process to get past "access denied" errors, which otherwise would stop naive IO.DirectoryInfo enumeration;
    7. progress reporting.

    function List-PathsInCsv([string[]]$PATHS, [string]$destination) {
        $prefix = '\\?\' #' UNC prefix lifts 260 character path length restriction
        $writer = [IO.StreamWriter]::new($destination, $false, [Text.Encoding]::UTF8, 1MB)
        $writer.WriteLine('Name|Directory|Length|LastWriteTime')
        $queue = [Collections.Generic.Queue[string]]($PATHS -replace '^', $prefix)
        $numFiles = 0
    
        while ($queue.Count) {
            $dirInfo = [IO.DirectoryInfo]$queue.Dequeue()
            try {
                $dirEnumerator = $dirInfo.EnumerateFileSystemInfos()
            } catch {
                Write-Warning ("$_".replace($prefix, '') -replace '^.+?: "(.+?)"$', '$1')
                continue
            }
            $dirName = $dirInfo.FullName.replace($prefix, '')
    
            foreach ($entry in $dirEnumerator) {
                if ($entry -is [IO.FileInfo]) {
                    $writer.WriteLine([string]::Join('|', @(
                        $entry.Name
                        $dirName
                        $entry.Length
                        $entry.LastWriteTime
                    )))
                } else {
                    $queue.Enqueue($entry.FullName)
                }
                if (++$numFiles % 1000 -eq 0) {
                    Write-Progress -activity Digging -status "$numFiles files, $dirName"
                }
            }
        }
        $writer.Close()
        Write-Progress -activity Digging -Completed
    }
    

    Usage:

    List-PathsInCsv 'c:\windows', 'd:\foo\bar' 'r:\output.csv'