Search code examples
c#linqssis

Identify duplicate from large file based on one field and remove record


I have a relatively large csv file, with a lot of columns. I need to read that file, and determine if there are multiple records with the same "Test File Name" field and only grab the first record (by date) and copy those to a new file, essentially removing the "duplicate". The records are not true duplicates as each record has different data but the same "Test File Name" so, the general "remove duplicates" methods are not working for me. The duplicates are few and far between, so I need to loop through all of the records and only grab the first one entered, which is determined by the "Date Time" field in the record.

I need only one of each "Test File Name"

Duplicate Records

Identifying fields

I tried a group by and order by but I am not sure I am doing it correctly, because it is not removing the second record.

Update: Let me clarify, the file is not THAT large, it's under a MB but there are thousands of records. I am attempting to do this process in a script task of an SSIS package. I apologize for my ignorance in posting here, and the subject matter in general. I am new to C# as I mostly work in SQL.


Solution

  • Thank you for all your input, I was finally able to use LINQ similar to @Dmitry's comment, with a different variation and it worked. I figured it might help someone else in the future so I wanted to post the solution.

                if (File.Exists(databasedatacsv))
                {
                    //Read original file and remove duplicates
                    var data = File.ReadAllLines(databasedatacsv)
                                        .Select(x => x.Split(','))
                                        .Where(x => x[1] != "")
                                        .GroupBy(x => x[1])
                                        .Select(x => x.OrderBy(y => y[2]).First());
    
                    foreach(var item in data)
                    {
                        string s = string.Join(",", item);
                        //Dump the data into a new CSV so we aren't modifying the original file
                        AppendPSFile(psdatabasedatacsv, s);
                    }