Search code examples
c#csvsortingfilehelpers

Sorting a List array and delete duplicate rows based on column values


I have a CSV file with about 40 columns. I use FileHelpers to read the file and extract a single column to perform additional operations later in the code.

The extracted list of string contains duplicate string elements and I get rid of them with the Distinct() method.

A new feature request requires me to extract additional columns, which are doubles and not string. At the moment, I am converting these element to string and I am building a list array.

After including these additional columns, eliminating the duplicates is more difficult. Before, I did not care which of the element in a duplicate series was retained. Now I do and I have two selection criteria based on the value in the two additional columns

I prepared a table to exemplify the issue at hand

╔════════════╦════════════╦════════════╗
║ Property 1 ║ Property 2 ║ Property 3 ║
╠════════════╬════════════╬════════════╣
║ AAA        ║        100 ║       1000 ║
║ AAA        ║         50 ║        500 ║
║ AAA        ║         10 ║        800 ║
║ BBB        ║          5 ║         70 ║
║ BBB        ║         20 ║         20 ║
║ BBB        ║         18 ║         11 ║
║ CCC        ║         10 ║         13 ║
║ CCC        ║         10 ║        445 ║
║ CCC        ║          5 ║       1000 ║
║ DDD        ║          0 ║        100 ║
║ DDD        ║          0 ║        100 ║
║ DDD        ║          0 ║        100 ║
╚════════════╩════════════╩════════════╝

The first sorting criteria is property 1 and this is the column I originally extracted from the CSV file. The second criteria is property 2. For example, the element "AAA" has 3 property 2 values: 100, 50 and 10. In this case, I would retain the one with 100 and eliminate the other two. For element "BBB", I would retain the row with property 2 = 20

If there is more than one max value for property 2 (e.g., element "CCC"), I would look at Property 3 and select the row with the highest value.

Finally, if I have a case like "DDD", I would chose any of the row for they are all identical.

I would like to know:

  1. where would it be better to perform the parsing operation? On the FileHelper engine output or on the List array?
  2. Is there a method that already does what I would like to do?
  3. If not, could you provide a general overview of the strategy you would recommend?

For exemplification, below you can find the table I posted above with the expected resulta dummy data set

╔════════════╦════════════╦════════════╗
║ Property 1 ║ Property 2 ║ Property 3 ║
╠════════════╬════════════╬════════════╣
║ AAA        ║        100 ║       1000 ║
║ AAA        ║         50 ║        500 ║
║ AAA        ║         10 ║        800 ║
║ BBB        ║          5 ║         70 ║
║ BBB        ║         20 ║         20 ║
║ BBB        ║         18 ║         11 ║
║ CCC        ║         10 ║         13 ║
║ CCC        ║         10 ║        445 ║
║ CCC        ║          5 ║       1000 ║
║ DDD        ║          0 ║        100 ║
║ DDD        ║          0 ║        100 ║
║ DDD        ║          0 ║        100 ║
╚════════════╩════════════╩════════════╝

The desired output would be

╔════════════╦════════════╦════════════╗
║ Property 1 ║ Property 2 ║ Property 3 ║
╠════════════╬════════════╬════════════╣
║ AAA        ║        100 ║       1000 ║
║ BBB        ║         20 ║         20 ║
║ CCC        ║         10 ║        445 ║
║ DDD        ║          0 ║        100 ║
╚════════════╩════════════╩════════════╝

EDIT 1 I am including code snippet for the FileHelper class and a list that represent what the FileHelper engine would look like

The FileHelper class would be something like this (only the relevant keys are reported)

[DelimitedRecord(",")]
    [IgnoreFirst(1)]
    class Test
    {
        public string property_1;
        public double property_2;
        public double property_3;
    }

After reading a CSV file, you would have a list that look like this (using the tables above as an example)

static List<Test> test = new List<Test>
        {
                new Test {property_1 = "AAA", property_2 = 100, property_3 = 1000},
                new Test {property_1 = "AAA", property_2 = 50, property_3 = 500},
                new Test {property_1 = "AAA", property_2 = 10, property_3 = 800},
                new Test {property_1 = "BBB", property_2 = 5, property_3 = 70},
                new Test {property_1 = "BBB", property_2 = 20, property_3 = 20},
                new Test {property_1 = "BBB", property_2 = 18, property_3 = 11},
                new Test {property_1 = "CCC", property_2 = 10, property_3 = 13},
                new Test {property_1 = "CCC", property_2 = 10, property_3 = 445},
                new Test {property_1 = "CCC", property_2 = 5, property_3 = 1000},
                new Test {property_1 = "DDD", property_2 = 0, property_3 = 100},
                new Test {property_1 = "DDD", property_2 = 0, property_3 = 100},
                new Test {property_1 = "DDD", property_2 = 0, property_3 = 100},
        };

CURRENT SOLUTION

Based on jdweng code, I wrote the following

List<Test> sorted = test.AsEnumerable()
                .OrderByDescending(x => x.property_2).ThenByDescending(x => x.property_3).GroupBy(x => x.property_1).Select(x => x.First()).ToList();  

If you want to see the result on console, you can print them with the following

foreach (Test t in test)
{
    Console.WriteLine(t.property_1 + "\t" + t.property_2.ToString() + "\t" + t.property_3.ToString());
}

foreach (Test t in sorted)
{
    Console.WriteLine(t.property_1 + "\t" + t.property_2.ToString() + "\t" + t.property_3.ToString());
}

Solution

  • Here is a way using a helper method :

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Data;
    
    
    namespace ConsoleApplication123
    {
        class Program
        {
            static void Main(string[] args)
            {
                DataTable dt = new DataTable();
                dt.Columns.Add("Property 1", typeof(string));
                dt.Columns.Add("Property 2", typeof(int));
                dt.Columns.Add("Property 3", typeof(int));
    
                dt.Rows.Add(new object[] { "AAA", 100, 1000 });
                dt.Rows.Add(new object[] { "AAA", 50, 500 });
                dt.Rows.Add(new object[] { "AAA", 10, 800 });
                dt.Rows.Add(new object[] { "BBB", 5, 70 });
                dt.Rows.Add(new object[] { "BBB", 20, 20 });
                dt.Rows.Add(new object[] { "BBB", 18, 11 });
                dt.Rows.Add(new object[] { "CCC", 10, 13 });
                dt.Rows.Add(new object[] { "CCC", 10, 445 });
                dt.Rows.Add(new object[] { "CCC", 5, 1000 });
                dt.Rows.Add(new object[] { "DDD", 0, 100 });
                dt.Rows.Add(new object[] { "DDD", 0, 100 });
                dt.Rows.Add(new object[] { "DDD", 0, 100 });
    
    
                DataTable largest = dt.AsEnumerable()
                    .OrderByDescending(x => x.Field<int>("Property 2"))
                    .ThenByDescending(x => x.Field<int>("Property 3"))
                    .GroupBy(x => x.Field<string>("Property 1"))
                    .Select(x => x.First())
                    .CopyToDataTable();
    
                DataTable results = dt.AsEnumerable()
                    .GroupBy(x => x.Field<string>("Property 1")).Select(x => AddRowData(x.First(), x.ToArray())).CopyToDataTable();
    
            }
            static DataRow AddRowData(DataRow firstRow, DataRow[] allRows)
            {
                for (int col = 1; col < firstRow.ItemArray.Count(); col++)
                {
                    firstRow[col] = allRows.Sum(x => x.Field<int>(col));
                }
                return firstRow;
            }
        }
    
    }