Search code examples
powershellt-sqlchunkingbatching

Batching data before passing through the pipeline


I have a bunch of fileshares with many millions of files/folders on them. I am using gci -Recurse to get a full list of directories/files on the share, and I need to load several pieces of information from that gci into SQL server for additional analysis. The command I'm using to get the data is:

gci $SharePath -Recurse | select FullName, Attributes, Length, CreationTimeUtc, LastAccessTimeUtc, LasWriteTimeUtc

Now I could just pipe that to Write-SQLTableData using the recommended syntax to force Bulk-Inserts as suggested in Option 3 on Microsoft's Write-SqlTableData documentation page, like this:

$Params = @{
    ServerInstance = 'sqlservername'
    DatabaseName = 'databasename'
    SchemaName = 'dbo'
}
,(gci $SharePath -Recurse | select FullName, Attributes, Length, CreationTimeUtc, LastAccessTimeUtc, LasWriteTimeUtc) | Write-SqlTableData @Params -TableName 'Table1'

The results of this, however, is that the gci takes several hours to complete with no feedback and using up many GB of memory and slowing my machine to a crawl before finally dumping all of the data to SQL. If I leave off the ,( and the matching ), the data moves to SQL as it is generated, however the SQL server is pounded with millions of individual inserts.

What I'm looking for is an intermediate answer which uses the pipeline. I know I can store the gci results in a variable $gciresults and then pass 1000 lines at a time to SQL using $gciresults[0..999] and so on, but I am trying to utilize the pipeline so I don't use up too much memory. Ideally, there would be some cmdlet I'll call batching-cmdlet which would allow me to split my incoming data into bite-sized chunks without storing it all in memory first, like this:

gci ... | select FullName, ... | batching-cmdlet -batchsize 1000 | Write-SqlTableData @Params -TableName 'Table1'

Searches for such a cmdlet were unsuccessful. Does anyone have thoughts how I might accomplish this?


Solution

  • As of PowerShell 7.0, there is no batching (partitioning) mechanism, unfortunately.

    Therefore, you'll have to implement batching yourself for now:

    # Create an aux. queue for batching the objects.
    $batchSize = 1000
    $batch = [System.Collections.Generic.Queue[pscustomobject]]::new($batchSize)
    
    Get-ChildItem $SharePath -Recurse | 
      Select-Object FullName, Attributes, Length, CreationTimeUtc, LastAccessTimeUtc, LasWriteTimeUtc |
        ForEach-Object { 
          $batch.Enqueue($_) # add object to the batch
          if ($batch.Count -eq $batchSize) { # batch is full, write to table.
            # send batch as a whole through the pipeline
            , $batch | Write-SqlTableData @Params -TableName Table1
            $batch.Clear() # start next batch
          }
        }
    
    # Process any remaining items.
    if ($batch.Count) {
      , $batch | Write-SqlTableData @Params -TableName Table1
    }