Search code examples
powershelloledboledbconnectionoledbdataadapteroledbdatareader

How to sort 30Million csv records in Powershell


I am using oledbconnection to sort the first column of csv file. Oledb connection is executed up to 9 million records within 6 min duration successfully. But when am executing 10 million records, getting following alert message.

Exception calling "ExecuteReader" with "0" argument(s): "The query cannot be completed. Either the size of the query result is larger than the maximum size of a database (2 GB), or there is not enough temporary storage space on the disk to store the query result."

is there any other solution to sort 30 million using Powershell?

here is my script

$OutputFile = "D:\Performance_test_data\output1.csv"
$stream = [System.IO.StreamWriter]::new( $OutputFile )

$sb = [System.Text.StringBuilder]::new()
$sw = [Diagnostics.Stopwatch]::StartNew()

$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source='D:\Performance_test_data\';Extended Properties='Text;HDR=Yes;CharacterSet=65001;FMT=Delimited';")
$cmd=$conn.CreateCommand()
$cmd.CommandText="Select * from 1crores.csv order by col6"

$conn.open()

$data = $cmd.ExecuteReader()

echo "Query has been completed!"
$stream.WriteLine( "col1,col2,col3,col4,col5,col6")

while ($data.read()) 
{ 
  $stream.WriteLine( $data.GetValue(0) +',' + $data.GetValue(1)+',' + $data.GetValue(2)+',' + $data.GetValue(3)+',' + $data.GetValue(4)+',' + $data.GetValue(5))

}
echo "data written successfully!!!"

$stream.close()
$sw.Stop()
$sw.Elapsed

$cmd.Dispose()
$conn.Dispose()

Solution

  • I have added a new answer as this is a complete different approach to tackle this issue.
    Instead of creating temporary files (which presumable causes a lot of file opens and closures), you might consider to create a ordered list of indices and than go over the input file (-FilePath) multiple times and each time, process a selective number of lines (-BufferSize = 1Gb, you might have to tweak this "memory usage vs. performance" parameter):

    Function Sort-Csv {
        [CmdletBinding()] param(
            [string]$InputFile,
            [String]$Property,
            [string]$OutputFile,
            [Char]$Delimiter = ',',
            [System.Text.Encoding]$Encoding = [System.Text.Encoding]::Default,
            [Int]$BufferSize = 1Gb
        )
        Begin {
            if ($InputFile.StartsWith('.\')) { $InputFile = Join-Path (Get-Location) $InputFile }
            $Index = 0
            $Dictionary = [System.Collections.Generic.SortedDictionary[string, [Collections.Generic.List[Int]]]]::new()
            Import-Csv $InputFile -Delimiter $Delimiter -Encoding $Encoding | Foreach-Object { 
                if (!$Dictionary.ContainsKey($_.$Property)) { $Dictionary[$_.$Property] = [Collections.Generic.List[Int]]::new() }
                $Dictionary[$_.$Property].Add($Index++)
            }
            $Indices = [int[]]($Dictionary.Values | ForEach-Object { $_ })
            $Dictionary = $Null                                     # we only need the sorted index list
        }
        Process {
            $Start = 0
            $ChunkSize = [int]($BufferSize / (Get-Item $InputFile).Length * $Indices.Count / 2.2)
            While ($Start -lt $Indices.Count) {
                [System.GC]::Collect()
                $End = $Start + $ChunkSize - 1
                if ($End -ge $Indices.Count) { $End = $Indices.Count - 1 }
                $Chunk = @{}
                For ($i = $Start; $i -le $End; $i++) { $Chunk[$Indices[$i]] = $i }
                $Reader = [System.IO.StreamReader]::new($InputFile, $Encoding)
                $Header = $Reader.ReadLine()
                $i = $Start
                $Count = 0
                For ($i = 0; ($Line = $Reader.ReadLine()) -and $Count -lt $ChunkSize; $i++) {
                    if ($Chunk.Contains($i)) { $Chunk[$i] = $Line }
                }
                $Reader.Dispose()
                if ($OutputFile) {
                    if ($OutputFile.StartsWith('.\')) { $OutputFile = Join-Path (Get-Location) $OutputFile }
                    $Writer = [System.IO.StreamWriter]::new($OutputFile, ($Start -ne 0), $Encoding)
                    if ($Start -eq 0) { $Writer.WriteLine($Header) }
                    For ($i = $Start; $i -le $End; $i++) { $Writer.WriteLine($Chunk[$Indices[$i]]) }
                    $Writer.Dispose()
                } else {
                    $Start..$End | ForEach-Object { $Header } { $Chunk[$Indices[$_]] } | ConvertFrom-Csv -Delimiter $Delimiter
                }
                $Chunk = $Null
                $Start = $End + 1
            }
        }
    }
    

    Basic usage

    Sort-Csv .\Input.csv <PropertyName> -Output .\Output.csv
    Sort-Csv .\Input.csv <PropertyName> | ... | Export-Csv .\Output.csv
    

    Note that for 1Crones.csv it will probably just export the full file in once unless you set the -BufferSize to a lower amount e.g. 500Kb.