I've been playing with @Cinchoo's fantastic ETL system for C#. I need to compare two CSV files, where one CSV file is defined as a dynamically growing master table and the other is a feeder "detail" table.
The detail table may have differences in terms of NEW records, CHANGED records, or a record no longer (DELETED) existing in the master CSV file.
The output should be a 3rd table that replaces or updates the master table - so it's a growing CSV file.
Both tables have unique ID columns and a header row.
MASTER CSV
ID,name
1,Danny
2,Fred
3,Sam
DETAIL
ID,name
1,Danny
<-- record no longer exists
3,Pamela <-- name change
4,Fernando <-- new record
So far I've been referring to this fiddle, and the code below:
using System;
using ChoETL;
using System.Linq;
public class Program
{
public static void Main()
{
var input1 = ChoCSVReader.LoadText(csv1).WithFirstLineHeader().ToArray();
var input2 = ChoCSVReader.LoadText(csv2).WithFirstLineHeader().ToArray();
Console.WriteLine("NEW records\n");
using (var output = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
{
output.Write(input2.OfType<ChoDynamicObject>().Except(input1.OfType<ChoDynamicObject>(),
new ChoDynamicObjectEqualityComparer(new string[] { "id" })));
}
Console.WriteLine("\n\nDELETED records\n");
using (var output = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
{
output.Write(input1.OfType<ChoDynamicObject>().Except(input2.OfType<ChoDynamicObject>(),
new ChoDynamicObjectEqualityComparer(new string[] { "id" })));
}
Console.WriteLine("\n\nCHANGED records\n");
using (var output = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
{
output.Write(input1.OfType<ChoDynamicObject>().Except(input2.OfType<ChoDynamicObject>(),
new ChoDynamicObjectEqualityComparer(new string[] { "id", "name" })));
}
}
static string csv1 = @"
ID,name
1,Danny
2,Fred
3,Sam";
static string csv2 = @"
ID,name
1,Danny
3,Pamela
4,Fernando";
}
OUTPUT
NEW records
ID,name
4,Fernando
DELETED records
ID,name
2,Fred
CHANGED records
ID,name
2,Fred
3,Sam
The CHANGED records is not working. As an added extra, I need a status so I want it to look like this:
CHANGED records
ID,name,status
1,Danny,NOCHANGE
2,Fred,DELETED
3,Pamela,CHANGED
4,Fernando,NEW
Thanks
Here is how you can do with Cinchoo ETL
string csv1 = @"ID,name
1,Danny
2,Fred
3,Sam";
string csv2 = @"ID,name
1,Danny
3,Pamela
4,Fernando";
var r1 = ChoCSVReader.LoadText(csv1).WithFirstLineHeader().ToArray();
var r2 = ChoCSVReader.LoadText(csv2).WithFirstLineHeader().ToArray();
using (var w = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
{
var newItems = r2.OfType<ChoDynamicObject>().Except(r1.OfType<ChoDynamicObject>(), new ChoDynamicObjectEqualityComparer(new string[] { "ID" }))
.Select(r =>
{
var dict = r.AsDictionary();
dict["Status"] = "NEW";
return new ChoDynamicObject(dict);
}).ToArray();
var deletedItems = r1.OfType<ChoDynamicObject>().Except(r2.OfType<ChoDynamicObject>(), new ChoDynamicObjectEqualityComparer(new string[] { "ID" }))
.Select(r =>
{
var dict = r.AsDictionary();
dict["Status"] = "DELETED";
return new ChoDynamicObject(dict);
}).ToArray();
var changedItems = r2.OfType<ChoDynamicObject>().Except(r1.OfType<ChoDynamicObject>(), ChoDynamicObjectEqualityComparer.Default)
.Except(newItems.OfType<ChoDynamicObject>(), new ChoDynamicObjectEqualityComparer(new string[] { "ID" }))
.Select(r =>
{
var dict = r.AsDictionary();
dict["Status"] = "CHANGED";
return new ChoDynamicObject(dict);
}).ToArray();
var noChangeItems = r1.OfType<ChoDynamicObject>().Intersect(r2.OfType<ChoDynamicObject>(), ChoDynamicObjectEqualityComparer.Default)
.Select(r =>
{
var dict = r.AsDictionary();
dict["Status"] = "NOCHANGE";
return new ChoDynamicObject(dict);
}).ToArray();
var finalResult = Enumerable.Concat(newItems, deletedItems).Concat(changedItems).Concat(noChangeItems).OfType<dynamic>().OrderBy(r => r.ID);
w.Write(finalResult);
}
Console.WriteLine();
Output:
ID,name,Status
1,Danny,NOCHANGE
2,Fred,DELETED
3,Pamela,CHANGED
4,Fernando,NEW
Sample fiddle: https://dotnetfiddle.net/mrHpFx
UPDATE #1:
Above approach will work for small CSV files. For large CSV files, you must avoid it. Rather approach it in stream manner. Sample fiddle shows how (Not fully tested, but it gives direction to do it.)
Sample fiddle: https://dotnetfiddle.net/mh6w44
UPDATE #2:
Now Cinchoo ETL (v1.2.1.33) comes with built-in API to compare the CSV files in simplified manner
var r1 = ChoCSVReader.LoadText(csv1).WithFirstLineHeader().WithMaxScanRows(1).OfType<ChoDynamicObject>();
var r2 = ChoCSVReader.LoadText(csv2).WithFirstLineHeader().WithMaxScanRows(1).OfType<ChoDynamicObject>();
using (var w = new ChoCSVWriter(Console.Out).WithFirstLineHeader())
{
foreach (var t in r1.Compare(r2, "ID", "name" ))
{
dynamic v1 = t.MasterRecord as dynamic;
dynamic v2 = t.DetailRecord as dynamic;
if (t.Status == CompareStatus.Unchanged || t.Status == CompareStatus.Deleted)
{
v1.Status = t.Status.ToString();
w.Write(v1);
}
else
{
v2.Status = t.Status.ToString();
w.Write(v2);
}
}
}
Sample fiddle: https://dotnetfiddle.net/uPR5Sq