I'm writing a comparer script for two CSV files that has columns and corresponding values for each column on every new line. Columns are specified on the first line of the file. Each line after that contains data for each column. The file is semicolon-delimited.
I'm trying to create a program that can handle files with differing line numbers and number of columns and one that could also display the line number of values that differed and then create a new text file that displays the line number, column name and value of File 1 and File 2.
The comparison should be done based on some identifier instead of line by line. If some column data was missing specified in the column row, then it could display the number of columns which data was missing.
So for example:
worker1.csv:
Name | Age | Height | Gender |
---|---|---|---|
Bob | 21 | 190 | Male |
John | 35 | 182 | Male |
Rose | |||
Mary | 20 | 175 | Female |
worker2.csv
Name | Age | Height | Gender |
---|---|---|---|
Bob | 21 | 185 | Male |
John | 30 | 186 | Male |
Mary |
output.csv
Differences found in Mary:
File 2, Line number 3, Missing three values
Differences found in Bob:
File 1, Line number 1, Height: 190
File 2, Line number 1, Height: 185
Differences found in John:
File 1, Line number 2, Age: 35, Height: 182
File 2, Line number 3, Age: 30, Height: 186
How can I do this? I did look into LINQ's Except on both files lines, but how can I get the line numbers?
This is a little more complicated that it first appears. But if you approach it step-by-step, it's doable.
I'm going to assume that you have enough memory to load one of the file's records into a dictionary. If your files are very large, things get a lot more complicated.
The first thing you want to do is load one of the files into a dictionary, indexed by the ID. In my example, I'll assume that the ID is the name. Each record will be recorded in a FileLine
instance:
class FileLine
{
public int LineNumber;
public string Name;
public int Age;
public int Height;
public string Gender;
}
And your dictionary:
Dictionary<string, FileLine> File1Lines = new Dictionary<string, FileLine>();
Now, read the file into that dictionary:
int lineNumber = 0;
foreach (var line in File.ReadLines("worker1.csv")
{
// split the line and assign the fields.
// End up with name, age, height, and gender variables.
++lineNumber;
var theLine = new FileLine(
LineNumber = lineNumber,
Name = name,
Age = age,
Height = height,
Gender = gender);
File1Lines.Add(theLine.Name, theLine);
}
Now, you can read the second file, look up the item in the dictionary, and report any differences:
lineNumber = 0;
foreach (var line in File.ReadLines("worker2.csv"))
{
// split the line and create a FileLine instance.
// we'll call it line2
// Then, look to see if that line is in the File1Lines dictionary.
FileLine line1;
if (!File1Lines.TryGetValue(line2.Name, out line1))
{
// the line didn't exist in the first file
}
else
{
// Now compare individual fields
if (line2.Age != line1.Age)
{
// report that fields are different
}
// Do the same with other fields
}
}
Now, if you want to keep track of lines that were in the first file but not in the second file, then create a HashSet<string>
, and whenever you find a record in the second file, add the name to the hash set. When you're done with the second file, you can compare your hash set with the keys in the dictionary. So if your hash set is called FoundRecords
, then you'd have:
var recordsNotFound = File1Lines.Keys.Except(FoundRecords);
foreach (var name in recordsNotFound)
{
// look up item in the dictionary to report not found
}