I am looking for a way to improve the performance of C# code and would appreciate any help.
There are 2 tables: Table_1 and Table_2, and I want to collect data from Table_2 and save in Table_1 in the following form:
Table_1
Date | Some_Stat |
---|---|
2021-06-01 | 23.7 |
2021-06-02 | 12.6 |
2021-06-03 | 47.9 |
Table_2
Date | ID | A | B | C |
---|---|---|---|---|
2021-06-02 | 4 | 21 | 23 | 13 |
2021-06-02 | 3 | 67 | 31 | 25 |
2021-06-01 | 3 | 45 | 54 | 33 |
2021-06-03 | 3 | 71 | 28 | 51 |
2021-06-03 | 4 | 26 | 83 | 24 |
My goal: Table_1 after join
Date | Some_Stat | A_3 | B_3 | C_3 | A_4 | B_4 | C_4 |
---|---|---|---|---|---|---|---|
2021-06-01 | 23.7 | 45 | 54 | 33 | 0 | 0 | 0 |
2021-06-02 | 12.6 | 67 | 31 | 25 | 21 | 23 | 13 |
2021-06-03 | 47.9 | 71 | 28 | 51 | 26 | 83 | 24 |
In order to achieve this I used the code below, which works, but is very slow, especially when there are thousands of IDs. Basically what the code does is making first the required join and transferring LINQ join results to existing Table_1 (copies the columns). I have checked the performance time and the LINQ query is always very fast (time: 0ms), but the data transfer is the problem.
List<int> ids = Table_2.AsEnumerable().Select(s => s.Field<int>("ID").Distinct().ToList();
for (int i = 0; i < ids.Count; i++)
{
Table_1.Columns.Add($"A_{ids[i]}", typeof(double));
Table_1.Columns.Add($"B_{ids[i]}", typeof(double));
Table_1.Columns.Add($"C_{ids[i]}", typeof(double));
}
for (int i = 0; i < ids.Count; i++)
{
// LINQ join (fast)
var joinedTables = from T1 in Table_1.AsEnumerable()
join T2 in Table_2.Select($"ID = {ids[i]}").AsEnumerable()
on (String)T1["Date"] equals (String)T2["Date"]
into T1_and_T2
from TT in T1_and_T2.DefaultIfEmpty()
select new
{
Date = (String)T1["Date"],
A = TT != null ? (double)TT["A"] : 0.0,
B = TT != null ? (double)TT["B"] : 0.0,
C = TT != null ? (double)TT["C"] : 0.0,
};
// data transfer (very slow)
for (int day = 0; day < joinedTables.Count(); day++)
{
Table_1.Rows[day][$"A_{ids[i]}"] = joinedTables.ElementAt(day).A;
Table_1.Rows[day][$"B_{ids[i]}"] = joinedTables.ElementAt(day).B;
Table_1.Rows[day][$"C_{ids[i]}"] = joinedTables.ElementAt(day).C;
}
}
Also instead of the data transfer version above I have tried another way, but it is as slow as the previous:
int day = 0;
foreach( var row in joinedTables)
{
Table_1.Rows[day][$"A_{ids[i]}"] = row.A;
Table_1.Rows[day][$"B_{ids[i]}"] = row.B;
Table_1.Rows[day][$"C_{ids[i]}"] = row.C;
}
I am also open to new approaches on how to collect data from Table_2
in Table_1
. There could be a way to use build-in functions (written in C or C++) which will access Machine Code directly ( for example, a function which copies columns from one table to another table, like in python) in order to avoid looping through rows.
Note: In the original problem Table_1
may have up to 500 rows and 10 columns and Table_2
may have up to 5000 rows per day and populated 50 - 100%. The solutions above in the extreme case will take from 2 to 4 hours, which is very slow ( I believe this could be done within minutes).
My recommendation is to use a different approach. A DataTable
is not a particularly fast object, and looking up columns to set values is slow. Creating a new DataTable
to replace Table_1
can be much faster since you can use the DataRowCollection.Add()
method to quickly add rows.
Using a Dictionary
to convert Table_2
allows for much faster lookup than ElementAt
as well.
var joinDict = (from T2 in Table_2.AsEnumerable()
select new {
Date = T2.Field<string>("Date"),
ID = T2.Field<int>("ID"),
A = T2.Field<double>("A"),
B = T2.Field<double>("B"),
C = T2.Field<double>("C"),
})
.ToDictionary(t2 => (t2.Date, t2.ID));
List<int> ids = Table_2.AsEnumerable().Select(s => s.Field<int>("ID")).Distinct().OrderBy(x => x).ToList();
var ans = Table_1.Clone();
for (int i = 0; i < ids.Count; i++) {
ans.Columns.Add($"A_{ids[i]}", typeof(double));
ans.Columns.Add($"B_{ids[i]}", typeof(double));
ans.Columns.Add($"C_{ids[i]}", typeof(double));
}
foreach (DataRow row in Table_1.Rows) {
var newRow = new List<object> { row.Field<string>("Date") };
foreach (var id in ids) {
if (joinDict.TryGetValue((row.Field<string>("Date"), id), out var t2))
newRow.AddRange(new object[] { t2.A, t2.B, t2.C });
else
newRow.AddRange(new object[] { 0.0, 0.0, 0.0 });
}
ans.Rows.Add(newRow.ToArray());
}
Table_1 = ans;
In testing with 100 days in Table_1
and 500 rows per day in Table_2
75% populated, I get about 128x speedup.