Search code examples
c#linqdatatableexpression-trees

Join 2 DataTables on dynamic number of columns


I'm trying to join two DataTables on a dynamic number of columns. I've gotten as far as the code below. The problem is the ON statement of the join. How can I make this dynamic based on how many column names are in the list "joinColumnNames".

I was thinking I will need to build some sort of expression tree, but I can't find any examples of how to do this with multiple join columns and with the DataRow object which doesn't have properties for each column.

private DataTable Join(List<string> joinColumnNames, DataTable pullX, DataTable pullY)
{
    DataTable joinedTable = new DataTable();

    // Add all the columns from pullX
    foreach (string colName in joinColumnNames)
    {
        joinedTable.Columns.Add(pullX.Columns[colName]);
    }
    // Add unique columns from PullY
    foreach (DataColumn col in pullY.Columns)
    {
        if (!joinedTable.Columns.Contains((col.ColumnName)))
        {
            joinedTable.Columns.Add(col);
        }
    }

    var Join = (from PX in pullX.AsEnumerable()
                join PY in pullY.AsEnumerable() on 
                // This must be dynamic and join on every column mentioned in joinColumnNames
                new { A = PX[joinColumnNames[0]], B = PX[joinColumnNames[1]] } equals new { A = PY[joinColumnNames[0]], B = PY[joinColumnNames[1]] } 
                into Outer
                from PY in Outer.DefaultIfEmpty<DataRow>(pullY.NewRow())
                select new { PX, PY });

    foreach (var item in Join)
    {
        DataRow newRow = joinedTable.NewRow();
        foreach (DataColumn col in joinedTable.Columns)
        {
            var pullXValue = item.PX.Table.Columns.Contains(col.ColumnName) ? item.PX[col.ColumnName] : string.Empty;
            var pullYValue = item.PY.Table.Columns.Contains(col.ColumnName) ? item.PY[col.ColumnName] : string.Empty;
            newRow[col.ColumnName] = (pullXValue == null || string.IsNullOrEmpty(pullXValue.ToString())) ? pullYValue : pullXValue;
        }
        joinedTable.Rows.Add(newRow);
    }

    return joinedTable;
}

Adding a specific example to show input/output using 3 join columns (Country, Company, and DateId):

Pull X:

Country        Company       DateId    Sales
United States  Test1 Ltd    20160722    $25 

Canada         Test3 Ltd    20160723    $30 

Italy          Test4 Ltd    20160724    $40 

India          Test2 Ltd    20160725    $35 

Pull Y:

Country        Company       DateId    Downloads
United States  Test1 Ltd    20160722    500 

Mexico         Test2 Ltd    20160723    300 

Italy          Test4 Ltd    20160724    900 

Result:

Country        Company       DateId    Sales    Downloads
United States  Test1 Ltd    20160722    $25      500 

Canada         Test3 Ltd    20160723    $30  

Mexico         Test2 Ltd    20160723                300 

Italy          Test4 Ltd    20160724    $40      900 

India          Test2 Ltd    20160725    $35      

Solution

  • var Join = 
        from PX in pullX.AsEnumerable()
        join PY in pullY.AsEnumerable()
        on     string.Join("\0", joinColumnNames.Select(c => PX[c]))
        equals string.Join("\0", joinColumnNames.Select(c => PY[c]))
        into Outer
        from PY in Outer.DefaultIfEmpty<DataRow>(pullY.NewRow())
        select new { PX, PY };
    

    Another way is to have both DataTable in a DataSet and use DataRelation
    How To: Use DataRelation to perform a join on two DataTables in a DataSet?