Search code examples
c#csvstreamlarge-files

How to efficiently download, read and process CSV in C#


I'm working on a service that will collect a large CSV-file from an online resource, then as it's downloading, read the lines, (preferably in batches), and send them to a database. This should not use more than 256MB of RAM at any time, and not save a file to disk.

This is for a service that will run once every 7 days, and collect all the companies in the Norwegian Company Register, (a nifty, 250MB, 1.1 million line CSV is found here: http://hotell.difi.no/download/brreg/enhetsregisteret )

My application can easily download the file and add it to a List<>, and process it, but it uses 3.3 GB of RAM

public async Task<bool> CollectAndUpdateNorwegianCompanyRegistry()
{
    var request = await _httpClient.GetAsync(_options.Value.Urls["BrregCsv"]);

    request.EnsureSuccessStatusCode();

    using (var stream = await request.Content.ReadAsStreamAsync())
    using (var streamReader = new StreamReader(stream))
    {
        while (!streamReader.EndOfStream)
        {
            using (var csv = new CsvReader(streamReader)) // CsvReader is from the CsvHelper -nuget
            {
                csv.Configuration.Delimiter = ";";
                csv.Configuration.BadDataFound = null;
                csv.Configuration.RegisterClassMap<NorwegianCompanyClassMap>();

                await _sqlRepository.UpdateNorwegianCompaniesTable(csv.GetRecords<NorwegianCompany>().ToList());
            }
        }
    }

    return true;
}

Small note on the SqlRepository: I've replaced it with a simple "destroyer"-method that just clears the data, so as to not use any extra resources while debugging

What I'd expect is that the Garbage Collector would "destroy" the resources used as the lines of the file is processed, but it doesn't.

Put simply, I want the following to happen: As the CSV downloads, it reads a few lines, these are then sent to a method, and the lines in memory are then flushed

I'm definitely inexperienced at working with large datasets, so I'm working off other people's work, and not getting the results I expect

Thank you for your time and assistance


Solution

  • So getting some pointers from Sami Kuhmonen (@sami-kuhmonen) helped, and here's what I came up with this:

    public async Task<bool> CollectAndUpdateNorwegianCompanyRegistry()
    {
        using (var stream = await _httpClient.GetStreamAsync(_options.Value.Urls["BrregCsv"]))
        using (var streamReader = new StreamReader(stream))
        using (var csv = new CsvReader(streamReader))
        {
            csv.Configuration.Delimiter = ";";
            csv.Configuration.BadDataFound = null;
            csv.Configuration.RegisterClassMap<NorwegianCompanyClassMap>();
    
            await _sqlRepository.UpdateNorwegianCompaniesTable(csv.GetRecords<NorwegianCompany>());
        }
    
        return true;
    }
    

    It downloads the entire file and sends it to the SqlRepository in 20 seconds, never surpassing 15% CPU, or 30MB RAM

    Now, my next challenge is the SqlRepository, but this issue is solved