Search code examples
c#out-of-memorytext-filesstreamreaderstreamwriter

C# StreamReader throws OutOfMemoryException on last line when splitting large delimited text file


I am writing a function in C# to split a large delimited file into smaller delimited files. I am writing this function because a 2.7 GB file was taking hours to ETL and was causing a bottleneck to the ETL batch. I assign a max number of lines per files and start a new file once that limit is reached. If the file has a header, then I write the header to each file.

Initially, I tried using StreamReader and StreamWriter as recommended in many C# posts, as they are supposed to handle large files by reading one line at a time and not storing everything in memory. Additionally, I explored using File.ReadLines and File.WriteAllLines, as they are also recommended as memory-efficient options.

Both of these approaches work on smaller files; however, when processing the 2.7 GB with 1238951 rows, both methods failed with out-of-memory exceptions. Surprisingly, they managed to process 1238950 rows before throwing the exception, leaving me just one line short of completion. My PC has 32GB of RAM is is nowhere near pegged out on memory. What's going on here? Is there something I am missing?

internal void SplitFileWithStream(string localFilePath, bool hasHeader)
{
    try
    {
        var targetDirectory = Path.GetDirectoryName(localFilePath);
        var fileNameWithoutExtension = Path.GetFileNameWithoutExtension(localFilePath);
        var extension = Path.GetExtension(localFilePath);
        var fileSuffix = 0;
        var maxLinesPerFile = 100000;
        string? header = null;
        using (var sr = new StreamReader(localFilePath))
        {
            while (!sr.EndOfStream)
            {
                var lineNumber = 0;
                var newFileName = $"{fileNameWithoutExtension}__split_{(++fileSuffix)}{extension}";
                var newFilePath = Path.Combine(targetDirectory, newFileName);
                if (File.Exists(newFilePath))
                    File.Delete(newFilePath);
                using (var sw = new StreamWriter(newFilePath))
                {
                    if (!sr.EndOfStream && hasHeader)
                    {
                        if (header == null)
                        {
                            header = sr.ReadLine();
                            maxLinesPerFile++; //add one to max for header
                        }
                        sw.WriteLine(header);
                        lineNumber++;
                    }
                    while (!sr.EndOfStream && lineNumber < maxLinesPerFile)
                    {
                        sw.WriteLine(sr.ReadLine());
                        lineNumber++;
                    }
                    sw.Close();
                }
            }
            sr.Close();
        }
    }
    catch (Exception ex)
    {
        Console.WriteLine(ex.ToString());
    }
}
System.OutOfMemoryException: Exception of type
'System.OutOfMemoryException' was thrown.    at
System.Text.StringBuilder.ToString()    at
System.IO.StreamReader.ReadLine()    at
MyFileUtilities.FileSplitter.SplitFileWithStream(String localFilePath, Boolean hasHeader)
in C:\Repos\MySolution\MyFileUtilitites\FileSplitter.cs:line 341
internal void SplitFileWithReadLines(string localFilePath, bool hasHeader)
{
    try
    {
        var targetDirectory = Path.GetDirectoryName(localFilePath);
        var fileNameWithoutExtension = Path.GetFileNameWithoutExtension(localFilePath);
        var fileName = Path.GetFileName(localFilePath);
        var extension = Path.GetExtension(localFilePath);
        var fileSuffix = 0;
        var maxLinesPerFile = 100000;
        long position = 0;
        string? header = null;
        int i = 0;
        int skip = 0;
        string workingFilePath = localFilePath;
        while (true)
        {
            int take = maxLinesPerFile;
            if (hasHeader && header == null)
            {
                header = File.ReadLines(localFilePath).Take(1).First();
                skip = 1;
            }
            var linesToSplice = File.ReadLines(workingFilePath).Skip(skip).Take(take);
            if (!linesToSplice.Any())
                return;
            linesToSplice.Append(Environment.NewLine); //added just to see if it would try to write the actual last line, but did not work.
            var newFileName = $"{fileNameWithoutExtension}__split_{(++i)}{extension}";
            var newFilePath = Path.Combine(targetDirectory, newFileName);
            File.WriteAllLines(newFilePath, linesToSplice);
            skip += take;
        }
    }
    catch (Exception ex)
    {
        Console.WriteLine(ex.ToString());
    }
}
System.OutOfMemoryException: Exception of type
'System.OutOfMemoryException' was thrown.    at
System.Text.StringBuilder.ToString()    at
System.IO.StreamReader.ReadLine()    at
System.IO.ReadLinesIterator.MoveNext()    at
System.Linq.Enumerable.EnumerablePartition`1.MoveNext()    at
System.IO.File.InternalWriteAllLines(TextWriter writer, IEnumerable`1
contents)    at System.IO.File.WriteAllLines(String path,
IEnumerable`1 contents)    at
MyFileUtilities.FileSplitter.SplitFileWithReadLines(String
localFilePath, Boolean hasHeader) in
C:\Repos\MySolution\MyFileUtilitites\FileSplitter.cs\FileSplitter.cs:line
402

I realize there's linux commands, and other tools to do this, but if it's possible, I'd like a C# solution to add this to a C# utility library I have.

Edit: I tried File.ReadLines(workingFilePath).Skip(skip).Take(take).ToList(); and got another OutOfMemoryException.

System.OutOfMemoryException: Exception of type
'System.OutOfMemoryException' was thrown.    at
System.Text.StringBuilder.ToString()    at
System.IO.StreamReader.ReadLine()    at
System.IO.ReadLinesIterator.MoveNext()    at
System.Linq.Enumerable.EnumerablePartition`1.ToList()    at
System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)    at
MyFileUtilities.FileSplitter.SplitFileWithReadLines(String
localFilePath, Boolean hasHeader) in ...

Solution

  • I have concluded that the last line of the file was corrupt. I was unable to open it in most text editors, and in the few text editors I was able to open the file with, it was erratic or unresponsive. However, I downloaded a trial version of Ultra Edit, which was very fast and smooth. Ultra Edit opened the file easily and revealed what I believe are the hidden/corrupt characters at the end of the last line. They are represented by little squares that would not display in other text editors. 1,779,974,601 little squares to be exact! I recommend using Ultra Edit to search for corrupt characters in your file effectively.

    As seen in Ultra Edit