Search code examples
c#f#deedle

Merge two dataframes by 2 columns - Deedle


I am trying to combine two deedle dataframes in a way similar to merge in pandas. I will have one bigger frame where there is a year and quarter column with other data beside it, and then a smaller frame with a year, quarter, and ad partner column. I want to add the ad partner to every instance of the same year and quarter in the larger table.

There is some sample data at the link below: https://datascienceparichay.com/article/pandas-merge-dataframes-on-multiple-columns/

Any help would be greatly appreciated!

Thanks!

EDIT

I now have this code, although it throws a keyNotFound exception if there is a key in df1 that is not in df2, and does not return a table.

var df2ByYQ = df2.IndexRowsUsing(row => (row.GetAs<string>("Year".Trim()), row.GetAs<string>("Quarter".Trim())));
                
var bCol = df1.Rows.SelectOptional(kvp => df2ByYQ.Rows[(kvp.Value.Value.GetAs<string>("Year".Trim()), kvp.Value.Value.GetAs<string>("Quarter".Trim()))].TryGetAs<int>("b"));`

Solution

  • You can do this by reindexing the two data frames so that they both have the same row index formed by a tuple consisting of the year and quarter. The easiest way to do this is to use the Frame.indexRowsUsing function. For example, given two frames with a column in one and b in the other:

    let df1 =
      Frame.ofRecords [ 
        {|year=1999; quarter="q1"; a=1|} 
        {|year=1999; quarter="q2"; a=2|} ]
    
    let df2 =
      Frame.ofRecords [ 
        {|year=1999; quarter="q1"; b=3|} 
        {|year=1999; quarter="q2"; b=4|} ]
    
    let df1ByYQ = 
      df1 |> Frame.indexRowsUsing (fun row ->
        row.GetAs<int>("year"), row.GetAs<string>("quarter"))
    
    let df2ByYQ = 
      df2 |> Frame.indexRowsUsing (fun row ->
        row.GetAs<int>("year"), row.GetAs<string>("quarter"))
    

    Now you can use join to join the two frames - you just have to drop the columns that are shared in both of the frames (the operation does not allow overlapping columns):

    let res = 
      df1ByYQ
      |> Frame.dropCol "year"
      |> Frame.dropCol "quarter"
      |> Frame.join JoinKind.Inner df2ByYQ
    

    EDIT: In the scenario discussed in the comments, where df1 has rows with multiple records per year/quarter and we want to add information from a unique row in another data frame, I would reindex the other data frame and then use the map operation to find the corresponding matching row in the other data frame (this will be quite fast, because lookup by index is done via a hashtable):

    let df1 =
      Frame.ofRecords [ 
        {|year=1999; quarter="q1"; month="Jan"; a=1|} 
        {|year=1999; quarter="q1"; month="Feb"; a=2|} 
        {|year=1999; quarter="q2"; month="Jan"; a=3|} ]
    
    let df2 =
      Frame.ofRecords [ 
        {|year=1999; quarter="q1"; b=3|} 
        {|year=1999; quarter="q2"; b=4|} ]
    
    let df2ByYQ = 
      df2 |> Frame.indexRowsUsing (fun row ->
        row.GetAs<int>("year"), row.GetAs<string>("quarter"))
    
    let bColumn = df1 |> Frame.mapRows (fun _ row ->
      df2ByYQ.Rows.[(row.GetAs "year", row.GetAs "quarter")].GetAs<int>("b") )
    
    df1.AddColumn("b", bColumn)