Search code examples
c#performancelinqdatatable

Copy LINQ query results into existing DataTable with faster performance


I am looking for a way to improve the performance of C# code and would appreciate any help.

There are 2 tables: Table_1 and Table_2, and I want to collect data from Table_2 and save in Table_1 in the following form:

Table_1

Date Some_Stat
2021-06-01 23.7
2021-06-02 12.6
2021-06-03 47.9

Table_2

Date ID A B C
2021-06-02 4 21 23 13
2021-06-02 3 67 31 25
2021-06-01 3 45 54 33
2021-06-03 3 71 28 51
2021-06-03 4 26 83 24

My goal: Table_1 after join

Date Some_Stat A_3 B_3 C_3 A_4 B_4 C_4
2021-06-01 23.7 45 54 33 0 0 0
2021-06-02 12.6 67 31 25 21 23 13
2021-06-03 47.9 71 28 51 26 83 24

In order to achieve this I used the code below, which works, but is very slow, especially when there are thousands of IDs. Basically what the code does is making first the required join and transferring LINQ join results to existing Table_1 (copies the columns). I have checked the performance time and the LINQ query is always very fast (time: 0ms), but the data transfer is the problem.

List<int> ids = Table_2.AsEnumerable().Select(s => s.Field<int>("ID").Distinct().ToList();

for (int i = 0; i < ids.Count; i++)
    {
       Table_1.Columns.Add($"A_{ids[i]}", typeof(double));
       Table_1.Columns.Add($"B_{ids[i]}", typeof(double));
       Table_1.Columns.Add($"C_{ids[i]}", typeof(double));
    }

for (int i = 0; i < ids.Count; i++)
{
   // LINQ join (fast)
   var joinedTables = from T1 in Table_1.AsEnumerable()
            join T2 in Table_2.Select($"ID = {ids[i]}").AsEnumerable()
            on (String)T1["Date"] equals (String)T2["Date"]
            into T1_and_T2
            from TT in T1_and_T2.DefaultIfEmpty()
            select new
              {
                 Date = (String)T1["Date"],
                 A = TT != null ? (double)TT["A"] : 0.0,
                 B = TT != null ? (double)TT["B"] : 0.0,
                 C = TT != null ? (double)TT["C"] : 0.0,
              };
   // data transfer (very slow)
   for (int day = 0; day < joinedTables.Count(); day++)
   {
     Table_1.Rows[day][$"A_{ids[i]}"] = joinedTables.ElementAt(day).A;
     Table_1.Rows[day][$"B_{ids[i]}"] = joinedTables.ElementAt(day).B;
     Table_1.Rows[day][$"C_{ids[i]}"] = joinedTables.ElementAt(day).C;
   }
}

Also instead of the data transfer version above I have tried another way, but it is as slow as the previous:

int day = 0;
foreach( var row in joinedTables)
{
  Table_1.Rows[day][$"A_{ids[i]}"] = row.A;
  Table_1.Rows[day][$"B_{ids[i]}"] = row.B;
  Table_1.Rows[day][$"C_{ids[i]}"] = row.C;
}

I am also open to new approaches on how to collect data from Table_2 in Table_1. There could be a way to use build-in functions (written in C or C++) which will access Machine Code directly ( for example, a function which copies columns from one table to another table, like in python) in order to avoid looping through rows.

Note: In the original problem Table_1 may have up to 500 rows and 10 columns and Table_2 may have up to 5000 rows per day and populated 50 - 100%. The solutions above in the extreme case will take from 2 to 4 hours, which is very slow ( I believe this could be done within minutes).


Solution

  • My recommendation is to use a different approach. A DataTable is not a particularly fast object, and looking up columns to set values is slow. Creating a new DataTable to replace Table_1 can be much faster since you can use the DataRowCollection.Add() method to quickly add rows.

    Using a Dictionary to convert Table_2 allows for much faster lookup than ElementAt as well.

    var joinDict = (from T2 in Table_2.AsEnumerable()
                    select new {
                        Date = T2.Field<string>("Date"),
                        ID = T2.Field<int>("ID"),
                        A = T2.Field<double>("A"),
                        B = T2.Field<double>("B"),
                        C = T2.Field<double>("C"),
                    })
                    .ToDictionary(t2 => (t2.Date, t2.ID));
    
    List<int> ids = Table_2.AsEnumerable().Select(s => s.Field<int>("ID")).Distinct().OrderBy(x => x).ToList();
    
    var ans = Table_1.Clone();
    for (int i = 0; i < ids.Count; i++) {
        ans.Columns.Add($"A_{ids[i]}", typeof(double));
        ans.Columns.Add($"B_{ids[i]}", typeof(double));
        ans.Columns.Add($"C_{ids[i]}", typeof(double));
    }
    
    foreach (DataRow row in Table_1.Rows) {
        var newRow = new List<object> { row.Field<string>("Date") };
        foreach (var id in ids) {
            if (joinDict.TryGetValue((row.Field<string>("Date"), id), out var t2))
                newRow.AddRange(new object[] { t2.A, t2.B, t2.C });
            else
                newRow.AddRange(new object[] { 0.0, 0.0, 0.0 });
        }
        ans.Rows.Add(newRow.ToArray());
    }
    Table_1 = ans;
    

    In testing with 100 days in Table_1 and 500 rows per day in Table_2 75% populated, I get about 128x speedup.