Search code examples
powershellssis-2012

Out of memory exception on [System.IO.File]::ReadAllText with large CSV


I have a simple PowerShell script that replaces "false" or "true" with "0" or "1":

$InputFolder = $args[0];
if($InputFolder.Length -lt 3)
{
    Write-Host "Enter a path name as your first argument" -foregroundcolor Red
    return
}
if(-not (Test-Path $InputFolder)) {
    Write-Host "File path does not appear to be valid" -foregroundcolor Red
    return
}
Get-ChildItem $InputFolder
$content = [System.IO.File]::ReadAllText($InputFolder).Replace("`"false`"", "`"0`"").Replace("`"true`"", "`"1`"").Replace("`"FALSE`"", "`"0`"").Replace("`"TRUE`"", "`"1`"")
[System.IO.File]::WriteAllText($InputFolder, $content)
[GC]::Collect()

This works fine for almost all files I have to amend, with the exception of one 808MB CSV. I have no idea how many lines are in this CSV, as nothing I have will open it properly.

Interestingly, the PowerShell script will complete successfully when invoked manually via either PowerShell directly or via command prompt. When this is launched as part of the SSIS package it's required for, that's when the error happens.

Sample data for the file:

"RowIdentifier","DateProfileCreated","IdProfileCreatedBy","IDStaffMemberProfileRole","StaffRole","DateEmploymentStart","DateEmploymentEnd","PPAID","GPLocalCode","IDStaffMember","IDOrganisation","GmpID","RemovedData"     
"134","09/07/1999 00:00","-1","98","GP Partner","09/07/1999 00:00","14/08/2009 15:29","341159","BRA 871","141","B83067","G3411591","0"

Error message thrown:

PowerShell Error Message

I'm not tied to PowerShell - I'm open to other options. I had a cribbed together C# script previously, but that died on small files than this - I'm no C# developer, so was unable to debug it at all.

Any suggestions or help gratefully received.


Solution

    • Generally, avoiding read large files all at once, as you can run out of memory, as you've experienced.

    • Instead, process text-based files line by line - both reading and writing.

      • While PowerShell generally excels at line-by-line (object-by-object) processing, it it is slow with files with many lines.

      • Using the .NET Framework directly - while more complex - offers much better performance.

    • If you process the input file line by line, you cannot directly write back to it and must instead write to a temporary output file, which you can replace the input file with on success.

    Here's a solution that uses .NET types directly for performance reasons:

    # Be sure to use a *full* path, because .NET typically doesn't have the same working dir. as PS.
    $inFile = Convert-Path $Args[0]      
    $tmpOutFile = [io.path]::GetTempFileName()
    
    $tmpOutFileWriter = [IO.File]::CreateText($tmpOutFile)
    foreach ($line in [IO.File]::ReadLines($inFile)) {
      $tmpOutFileWriter.WriteLine(
        $line.Replace('"false"', '"0"').Replace('"true"', '"1"').Replace('"FALSE"', '"0"').Replace('"TRUE"', '"1"')
      )
    }
    $tmpOutFileWriter.Dispose()
    
    # Replace the input file with the temporary file.
    # !! BE SURE TO MAKE A BACKUP COPY FIRST.
    # -WhatIf *previews* the move operation; remove it to perform the actual move.
    Move-Item -Force -LiteralPath $tmpOutFile $inFile -WhatIf
    

    Note:

    • UTF-8 encoding is assumed, and the rewritten file will not have a BOM. You can change this by specifying the desired encoding to the .NET methods.

    • As an aside: Your chain of .Replace() calls on each input line can be simplified as follows, using PowerShell's -replace operator, which is case-insensitive, so only 2 replacements are needed:
      $line -replace '"false"', '"0"' -replace '"true"', '"1"'
      However, while that is shorter to write, it is actually slower than the .Replace() call chain, presumably because -replace is regex-based, which incurs extra processing.