Search code examples
c#linqcsvcomparechoetl

How to use ChoETL to compare two CSV files for ADD, CHANGED or DELETED records (Master vs Detail)?


I've been playing with @Cinchoo's fantastic ETL system for C#. I need to compare two CSV files, where one CSV file is defined as a dynamically growing master table and the other is a feeder "detail" table.

The detail table may have differences in terms of NEW records, CHANGED records, or a record no longer (DELETED) existing in the master CSV file.

The output should be a 3rd table that replaces or updates the master table - so it's a growing CSV file.

Both tables have unique ID columns and a header row.

MASTER CSV

ID,name
1,Danny
2,Fred
3,Sam

DETAIL

ID,name
1,Danny
          <-- record no longer exists
3,Pamela <-- name change
4,Fernando   <-- new record

So far I've been referring to this fiddle, and the code below:

using System;
using ChoETL;
using System.Linq;

public class Program
{
    public static void Main()
    {
        var input1 = ChoCSVReader.LoadText(csv1).WithFirstLineHeader().ToArray();
        var input2 = ChoCSVReader.LoadText(csv2).WithFirstLineHeader().ToArray();

        Console.WriteLine("NEW records\n");
        using (var output = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
        {
            output.Write(input2.OfType<ChoDynamicObject>().Except(input1.OfType<ChoDynamicObject>(), 
                                                                  new ChoDynamicObjectEqualityComparer(new string[] { "id" })));
        }
        
        Console.WriteLine("\n\nDELETED records\n");
        using (var output = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
        {
            output.Write(input1.OfType<ChoDynamicObject>().Except(input2.OfType<ChoDynamicObject>(), 
                                                                  new ChoDynamicObjectEqualityComparer(new string[] { "id" })));
        }
        
        Console.WriteLine("\n\nCHANGED records\n");
        using (var output = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
        {
            output.Write(input1.OfType<ChoDynamicObject>().Except(input2.OfType<ChoDynamicObject>(), 
                                                                  new ChoDynamicObjectEqualityComparer(new string[] { "id", "name" })));
        }
    }
    
    static string csv1 = @"
ID,name
1,Danny
2,Fred
3,Sam";
    
    static string csv2 = @"
ID,name
1,Danny
3,Pamela
4,Fernando";
}

OUTPUT

NEW records

ID,name
4,Fernando

DELETED records

ID,name
2,Fred

CHANGED records

ID,name
2,Fred
3,Sam

The CHANGED records is not working. As an added extra, I need a status so I want it to look like this:

CHANGED records
    
ID,name,status
1,Danny,NOCHANGE
2,Fred,DELETED
3,Pamela,CHANGED
4,Fernando,NEW

Thanks


Solution

  • Here is how you can do with Cinchoo ETL

                string csv1 = @"ID,name
    1,Danny
    2,Fred
    3,Sam";
    
                string csv2 = @"ID,name
    1,Danny
    3,Pamela
    4,Fernando";
    
                var r1 = ChoCSVReader.LoadText(csv1).WithFirstLineHeader().ToArray();
                var r2 = ChoCSVReader.LoadText(csv2).WithFirstLineHeader().ToArray();
    
                using (var w = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
                {
                    var newItems = r2.OfType<ChoDynamicObject>().Except(r1.OfType<ChoDynamicObject>(), new ChoDynamicObjectEqualityComparer(new string[] { "ID" }))
                        .Select(r => 
                        {
                            var dict = r.AsDictionary();
                            dict["Status"] = "NEW"; 
                            return new ChoDynamicObject(dict); 
                        }).ToArray();
    
                    var deletedItems = r1.OfType<ChoDynamicObject>().Except(r2.OfType<ChoDynamicObject>(), new ChoDynamicObjectEqualityComparer(new string[] { "ID" }))
                        .Select(r =>
                        {
                            var dict = r.AsDictionary();
                            dict["Status"] = "DELETED";
                            return new ChoDynamicObject(dict);
                        }).ToArray();
    
                    var changedItems = r2.OfType<ChoDynamicObject>().Except(r1.OfType<ChoDynamicObject>(), ChoDynamicObjectEqualityComparer.Default)
                        .Except(newItems.OfType<ChoDynamicObject>(), new ChoDynamicObjectEqualityComparer(new string[] { "ID" }))
                        .Select(r =>
                        {
                            var dict = r.AsDictionary();
                            dict["Status"] = "CHANGED";
                            return new ChoDynamicObject(dict);
                        }).ToArray();
    
                    var noChangeItems = r1.OfType<ChoDynamicObject>().Intersect(r2.OfType<ChoDynamicObject>(), ChoDynamicObjectEqualityComparer.Default)
                        .Select(r =>
                        {
                            var dict = r.AsDictionary();
                            dict["Status"] = "NOCHANGE";
                            return new ChoDynamicObject(dict);
                        }).ToArray();
    
                    var finalResult = Enumerable.Concat(newItems, deletedItems).Concat(changedItems).Concat(noChangeItems).OfType<dynamic>().OrderBy(r => r.ID);
                    w.Write(finalResult);
                }
    
                Console.WriteLine();
    

    Output:

    ID,name,Status
    1,Danny,NOCHANGE
    2,Fred,DELETED
    3,Pamela,CHANGED
    4,Fernando,NEW
    

    Sample fiddle: https://dotnetfiddle.net/mrHpFx

    UPDATE #1:

    Above approach will work for small CSV files. For large CSV files, you must avoid it. Rather approach it in stream manner. Sample fiddle shows how (Not fully tested, but it gives direction to do it.)

    Sample fiddle: https://dotnetfiddle.net/mh6w44

    UPDATE #2:

    Now Cinchoo ETL (v1.2.1.33) comes with built-in API to compare the CSV files in simplified manner

    var r1 = ChoCSVReader.LoadText(csv1).WithFirstLineHeader().WithMaxScanRows(1).OfType<ChoDynamicObject>();
    var r2 = ChoCSVReader.LoadText(csv2).WithFirstLineHeader().WithMaxScanRows(1).OfType<ChoDynamicObject>();
    
    using (var w = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
    {
        foreach (var t in r1.Compare(r2, "ID", "name" ))
        {
            dynamic v1 = t.MasterRecord as dynamic;
            dynamic v2 = t.DetailRecord as dynamic;
            if (t.Status == CompareStatus.Unchanged || t.Status == CompareStatus.Deleted)
            {
                v1.Status = t.Status.ToString();
                w.Write(v1);
            }
            else 
            {
                v2.Status = t.Status.ToString();
                w.Write(v2);
            }
        }
    }
    

    Sample fiddle: https://dotnetfiddle.net/uPR5Sq