Search code examples
c#csvienumerablecsvhelper

How to determine size of IEnumerable in bytes in order to get a good batch size?


I'm writing records into a csv format in order to upload it as a file through an external API that has a file size limit on the uploads. My method for writing records to memory looks like this:

using CsvHelper;

public async Task<byte[]> WriteToMemoryAsync<T>(IEnumerable<T> recordsToWrite) where T : class
{
    using (var memoryStream = new MemoryStream())
    {
        using (var writer = new StreamWriter(memoryStream))
        using (var csv = new CsvWriter(writer, new CultureInfo("sv-SE")))
        {
            await csv.WriteRecordsAsync(recordsToWrite);

        }
        return memoryStream.ToArray();
    }
}

My current approach for batching looks like this:

public async Task<Dictionary<int, byte[]>> BatchWriteToMemoryAsync<T>(IEnumerable<T> recordsToWrite) where T : class
{
    var maxBatchSize = 50_000;

    var nrOfBatches = (int)Math.Ceiling((double)recordsToWrite.Count() / maxBatchSize);

    Dictionary<int, byte[]> records = new();
    for (int batchNr = 0; batchNr < nrOfBatches; batchNr++)
    {
        records.Add(batchNr, await WriteToMemoryAsync<T>(recordsToWrite));
    }

    return records;
}

The problem is that the batch size is sort of arbitrary. Depending on how large objects T includes this might fail.

Is there any way to get the size in bytes of the IEnumerable<T> in order to get an approximation of the file size and then determine the number of batches based on this?

Is there any other way to approach this?

Edit


I've now implemented the first solution that Magnus suggested. However, there is a problem with flushing the written record before checking the length of the stream as the stream has already passed the size limit at that point.

I created a test to simulate the problem and this fails as the batch size hits 1009B for the first batch.

[Fact]
public void WhenWritingToMemoryInBatches_ShouldEnsureFileSizeDoesNotExceedLimit()
{
    //Arrange
    var records = GenerateTestRecords(100);

    var fileSizeLimit = 1_000;  //1000B limit

    var csvHandler = new CsvHandler();

    //Act
    var file = csvHandler.BatchWriteToMemory(records, fileSizeLimit);

    //Assert
    Assert.All(file, f => Assert.True(f.Length < fileSizeLimit, $"Expected fileSize to be less than {fileSizeLimit}. Actual fileSize was {f.Length}."));
}

private IEnumerable<TestRecord> GenerateTestRecords(int amountOfRecords)
{
    List<TestRecord> records = new();
    for (int i = 0; i < amountOfRecords; i++)
    {
        records.Add(new TestRecord
        {
            StringType = $"String {i}",
            IntType = 1,
        });
    }
    return records;
}

private class TestRecord
{
    public string? StringType { get; set; }
    public int IntType { get; set; }
}


Solution

  • What if instead send all your items to WriteToMemory and then check the size of the stream and if of desired size return the items written and initialize a new stream for the next batch. So WriteToMemory will return a set of batches instead. You might not want to flush after every written record, find an appropriate flush interval.

    public static IEnumerable<byte[]> WriteToMemory<T>(IEnumerable<T> recordsToWrite)
    {
        var memoryStream = new MemoryStream();
        var writer = new StreamWriter(memoryStream);
        var csv = new CsvWriter(writer, new CultureInfo("sv-SE"));
    
        try
        {
            foreach (var r in recordsToWrite)
            {
                csv.WriteRecord(r);
                csv.Flush(); //might want to flush after every x items instead of each.
                if (memoryStream.Length >= 1024)
                {
                    csv.Dispose();
                    writer.Dispose();
                    memoryStream.Dispose();
                    
                    yield return memoryStream.ToArray();
                    
                    memoryStream = new MemoryStream();
                    writer = new StreamWriter(memoryStream);
                    csv = new CsvWriter(writer, new CultureInfo("sv-SE"));
                }
            }
            
            csv.Flush();
            if (memoryStream.Length > 0)
                yield return memoryStream.ToArray();
        }
        finally
        {
            csv.Dispose();
            writer.Dispose();
            memoryStream.Dispose();
        }
    }
    

    To avoid holding large amounts of byte arrays in memory you can instead take in a delegate to your method to create the stream (for example a file on disk).

    public class Program
    {
        private static int count = 0;
        public static async Task Main()
        {
            await WriteToStreamAsync(Enumerable.Range(0, 10_000), () => File.Create($"C:\\temp\\\\files\\file{count++}.csv"));
        }
    
        public static async Task WriteToStreamAsync<T>(IEnumerable<T> recordsToWrite, Func<Stream> createFile)
        {
            var stream = createFile();
            var writer = new StreamWriter(stream);
            var csv = new CsvWriter(writer, new CultureInfo("sv-SE"));
    
            try
            {
                var i = 0;
                foreach (var r in recordsToWrite)
                {
                    csv.WriteRecord(r);
                    if (++i % 100 == 0) //Find some good interval
                        await csv.FlushAsync();
                    if (stream.Length >= 1024)
                    {
                        await csv.DisposeAsync();
                        await writer.DisposeAsync();
                        await stream.DisposeAsync();
                        
                        stream = createFile();
                        writer = new StreamWriter(stream);
                        csv = new CsvWriter(writer, new CultureInfo("sv-SE"));
                    }
                }
            }
            finally
            {
                await csv.DisposeAsync();
                await writer.DisposeAsync();
                await stream.DisposeAsync();
            }
        }
    }