I have a CSV file with about 40 columns. I use FileHelpers to read the file and extract a single column to perform additional operations later in the code.
The extracted list of string contains duplicate string elements and I get rid of them with the Distinct() method.
A new feature request requires me to extract additional columns, which are doubles and not string. At the moment, I am converting these element to string and I am building a list array.
After including these additional columns, eliminating the duplicates is more difficult. Before, I did not care which of the element in a duplicate series was retained. Now I do and I have two selection criteria based on the value in the two additional columns
I prepared a table to exemplify the issue at hand
╔════════════╦════════════╦════════════╗
║ Property 1 ║ Property 2 ║ Property 3 ║
╠════════════╬════════════╬════════════╣
║ AAA ║ 100 ║ 1000 ║
║ AAA ║ 50 ║ 500 ║
║ AAA ║ 10 ║ 800 ║
║ BBB ║ 5 ║ 70 ║
║ BBB ║ 20 ║ 20 ║
║ BBB ║ 18 ║ 11 ║
║ CCC ║ 10 ║ 13 ║
║ CCC ║ 10 ║ 445 ║
║ CCC ║ 5 ║ 1000 ║
║ DDD ║ 0 ║ 100 ║
║ DDD ║ 0 ║ 100 ║
║ DDD ║ 0 ║ 100 ║
╚════════════╩════════════╩════════════╝
The first sorting criteria is property 1 and this is the column I originally extracted from the CSV file. The second criteria is property 2. For example, the element "AAA" has 3 property 2 values: 100, 50 and 10. In this case, I would retain the one with 100 and eliminate the other two. For element "BBB", I would retain the row with property 2 = 20
If there is more than one max value for property 2 (e.g., element "CCC"), I would look at Property 3 and select the row with the highest value.
Finally, if I have a case like "DDD", I would chose any of the row for they are all identical.
I would like to know:
FileHelper
engine output or on the List array?For exemplification, below you can find the table I posted above with the expected resulta dummy data set
╔════════════╦════════════╦════════════╗
║ Property 1 ║ Property 2 ║ Property 3 ║
╠════════════╬════════════╬════════════╣
║ AAA ║ 100 ║ 1000 ║
║ AAA ║ 50 ║ 500 ║
║ AAA ║ 10 ║ 800 ║
║ BBB ║ 5 ║ 70 ║
║ BBB ║ 20 ║ 20 ║
║ BBB ║ 18 ║ 11 ║
║ CCC ║ 10 ║ 13 ║
║ CCC ║ 10 ║ 445 ║
║ CCC ║ 5 ║ 1000 ║
║ DDD ║ 0 ║ 100 ║
║ DDD ║ 0 ║ 100 ║
║ DDD ║ 0 ║ 100 ║
╚════════════╩════════════╩════════════╝
The desired output would be
╔════════════╦════════════╦════════════╗
║ Property 1 ║ Property 2 ║ Property 3 ║
╠════════════╬════════════╬════════════╣
║ AAA ║ 100 ║ 1000 ║
║ BBB ║ 20 ║ 20 ║
║ CCC ║ 10 ║ 445 ║
║ DDD ║ 0 ║ 100 ║
╚════════════╩════════════╩════════════╝
EDIT 1 I am including code snippet for the FileHelper class and a list that represent what the FileHelper engine would look like
The FileHelper class would be something like this (only the relevant keys are reported)
[DelimitedRecord(",")]
[IgnoreFirst(1)]
class Test
{
public string property_1;
public double property_2;
public double property_3;
}
After reading a CSV file, you would have a list that look like this (using the tables above as an example)
static List<Test> test = new List<Test>
{
new Test {property_1 = "AAA", property_2 = 100, property_3 = 1000},
new Test {property_1 = "AAA", property_2 = 50, property_3 = 500},
new Test {property_1 = "AAA", property_2 = 10, property_3 = 800},
new Test {property_1 = "BBB", property_2 = 5, property_3 = 70},
new Test {property_1 = "BBB", property_2 = 20, property_3 = 20},
new Test {property_1 = "BBB", property_2 = 18, property_3 = 11},
new Test {property_1 = "CCC", property_2 = 10, property_3 = 13},
new Test {property_1 = "CCC", property_2 = 10, property_3 = 445},
new Test {property_1 = "CCC", property_2 = 5, property_3 = 1000},
new Test {property_1 = "DDD", property_2 = 0, property_3 = 100},
new Test {property_1 = "DDD", property_2 = 0, property_3 = 100},
new Test {property_1 = "DDD", property_2 = 0, property_3 = 100},
};
CURRENT SOLUTION
Based on jdweng code, I wrote the following
List<Test> sorted = test.AsEnumerable()
.OrderByDescending(x => x.property_2).ThenByDescending(x => x.property_3).GroupBy(x => x.property_1).Select(x => x.First()).ToList();
If you want to see the result on console, you can print them with the following
foreach (Test t in test)
{
Console.WriteLine(t.property_1 + "\t" + t.property_2.ToString() + "\t" + t.property_3.ToString());
}
foreach (Test t in sorted)
{
Console.WriteLine(t.property_1 + "\t" + t.property_2.ToString() + "\t" + t.property_3.ToString());
}
Here is a way using a helper method :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;
namespace ConsoleApplication123
{
class Program
{
static void Main(string[] args)
{
DataTable dt = new DataTable();
dt.Columns.Add("Property 1", typeof(string));
dt.Columns.Add("Property 2", typeof(int));
dt.Columns.Add("Property 3", typeof(int));
dt.Rows.Add(new object[] { "AAA", 100, 1000 });
dt.Rows.Add(new object[] { "AAA", 50, 500 });
dt.Rows.Add(new object[] { "AAA", 10, 800 });
dt.Rows.Add(new object[] { "BBB", 5, 70 });
dt.Rows.Add(new object[] { "BBB", 20, 20 });
dt.Rows.Add(new object[] { "BBB", 18, 11 });
dt.Rows.Add(new object[] { "CCC", 10, 13 });
dt.Rows.Add(new object[] { "CCC", 10, 445 });
dt.Rows.Add(new object[] { "CCC", 5, 1000 });
dt.Rows.Add(new object[] { "DDD", 0, 100 });
dt.Rows.Add(new object[] { "DDD", 0, 100 });
dt.Rows.Add(new object[] { "DDD", 0, 100 });
DataTable largest = dt.AsEnumerable()
.OrderByDescending(x => x.Field<int>("Property 2"))
.ThenByDescending(x => x.Field<int>("Property 3"))
.GroupBy(x => x.Field<string>("Property 1"))
.Select(x => x.First())
.CopyToDataTable();
DataTable results = dt.AsEnumerable()
.GroupBy(x => x.Field<string>("Property 1")).Select(x => AddRowData(x.First(), x.ToArray())).CopyToDataTable();
}
static DataRow AddRowData(DataRow firstRow, DataRow[] allRows)
{
for (int col = 1; col < firstRow.ItemArray.Count(); col++)
{
firstRow[col] = allRows.Sum(x => x.Field<int>(col));
}
return firstRow;
}
}
}