Search code examples
c#csvcsvhelper

Adding fields midway while creating a CSV file


I need to create a CSV file from a database table that holds JSON data in one of the table's columns. Each line in the table can hold a JSON with different variables.

The CSV I create should have fields for all variables in all lines.

Example:

Line 1 in db table : {"height":100,"weight":50}

Line 2 in db table: {"color":"red"}

Line 3 in db table: {"color":"blue","height":75}

CSV to be created:

height,weight,color
100,50,
,,red
,75,blue

Since the table may include hundreds of thousands of lines, I would like to know if there is a way to add fields midway while reading lines from the table and creating the CSV without reading the entire table once to get all fields and then again to write the CSV.

I've used CSVHelper https://joshclose.github.io/CsvHelper/ for handling CSV operations in the past but I saw no option for my scenario.


Solution

  • The following code generate the required output from a list of strings containing json data. You could complete it reading data from database:

    public string GetCsv(string[] jsonValues) {
    
        var objectsAsDictionary = jsonValues.Select(_ =>
            JsonSerializer.Deserialize<Dictionary<string,object>>(_)).ToArray();
    
        using var textWriter =    new StringWriter();
        using var csv = new CsvWriter(textWriter, CultureInfo.InvariantCulture);
        var properties = objectsAsDictionary.SelectMany(_ => _.Keys)
            .Distinct().ToArray();
    
        foreach (var property in properties) {
            csv.WriteField(property);
        }
        csv.NextRecord();
        foreach (var item in objectsAsDictionary) {
            foreach (var property in properties) {
                object value = null;
                item.TryGetValue(property, out value);
                csv.WriteField(value);
            }
            csv.NextRecord();
        }
        return textWriter.ToString();
    }
    

    This is a test that check the output:

    var class1 = new Class1();
    var rows = new[] {
        "{\"height\":100,\"weight\":50}",
        "{\"color\":\"red\"}",
        "{\"color\":\"blue\",\"height\":75}"
    };
    
    var result = class1.GetCsv(rows);
    var expected = "height,weight,color" + Environment.NewLine +
    "100,50," + Environment.NewLine +
    ",,red" + Environment.NewLine +
    "75,,blue" + Environment.NewLine;
    Assert.Equal(expected, result);
    

    Version with an incremental approach

    The following code could be used if the requirement can change regarding the leading delimiter of first rows.

    In this case each value could be processed in an incremental way and the header is added only at the end.

    For simplicity I don't include DB access, but the enumerable parameter could be replaced by a data reader.

    public string GetCsv(IEnumerable<string> jsonValues) {
        var properties = new List<string>();        
    
        using var textWriter = new StringWriter();
        using var csv = new CsvWriter(textWriter, CultureInfo.InvariantCulture);      
    
        foreach (var jsonValue in jsonValues) {
            var item = JsonSerializer.Deserialize<Dictionary<string, object>>(jsonValue);            
            foreach (var property in item.Keys) {
                if (!properties.Contains(property)) {
                    properties.Add(property);
                }
            }
            foreach (var property in properties) {
                object value = null;
                item.TryGetValue(property, out value);
                csv.WriteField(value);
            }
            csv.NextRecord();
        }
        
        return string.Join(csv.Configuration.Delimiter, properties) + Environment.NewLine +
            textWriter.ToString();
    }
    

    In this case I don't need to read all data at once, but I added columns each time that I found a new property.

    The result would be the following (note that the second row doesn't include the final comma):

    height,weight,color
    100,50
    ,,red
    75,,blue
    

    Eventually a further processing could add missing commas to first rows.