Search code examples
joinmergepython-polars

Polars `join_asof` with duplicate 'by' values


Here are two data frames

df1 = pl.DataFrame({'title1': ['abc', 'abc', 'def'], 
'year1': [2016, 2000, 1934]})

df2 = pl.DataFrame({'title2': ['abc', 'abc', 'def'], 
'director2': ['john', 'paul', 'peter'], 
'year2': [2001, 2018, 2016], 
'val2': ['A', 'B', 'C']})

Doing the following, I would expect row 3 of df1 and df2 to match, which it does. But I would also expect row 1 and 2 to match with rows 2 and 1 of df1 and df2 respectively, since 2016 is closer to 2018 and 2000 is closer to 2001.

df1.join_asof(df2, left_on = 'year1', right_on = 'year2', by_left = ['title1'], by_right = ['title2'], strategy = 'nearest')

To be clear, this is what I would expect the output to be:

pl.DataFrame({'title1': ["abc", "abc", "def"],
 'year1': [2016, 2000, 1934],
 'director2': ["paul", "john", "peter"],
 'year2': [2018, 2001, 2016],
 'val2': ["B", "A", "C"]})

But rows 1 and 2 of df1 are both matching with row 2 of df2 instead. The above line of code actually gives me:

pl.DataFrame({'title1': ["abc", "abc", "def"],
 'year1': [2016, 2000, 1934],
 'director2': ["paul", "paul", "peter"],
 'year2': [2018, 2018, 2016],
 'val2': ["B", "B", "C"]})

My point of confusion is in bold - why is this not happening? Does it have something to do with the 'abc' group being duplicated?


Solution

  • I think the problem is that your DataFrames are not sorted while documentation says

    Both DataFrames must be sorted by the asof_join key.

    df1 = df1.sort('year1')
    df2 = df2.sort('year2')
    
    df1.join_asof(
        df2,
        left_on = 'year1',
        right_on = 'year2',
        by_left = ['title1'],
        by_right = ['title2'],
        strategy = 'nearest'
    )
    
    ┌────────┬───────┬───────────┬───────┬──────┐
    │ title1 ┆ year1 ┆ director2 ┆ year2 ┆ val2 │
    │ ---    ┆ ---   ┆ ---       ┆ ---   ┆ ---  │
    │ str    ┆ i64   ┆ str       ┆ i64   ┆ str  │
    ╞════════╪═══════╪═══════════╪═══════╪══════╡
    │ def    ┆ 1934  ┆ peter     ┆ 2016  ┆ C    │
    │ abc    ┆ 2000  ┆ john      ┆ 2001  ┆ A    │
    │ abc    ┆ 2016  ┆ paul      ┆ 2018  ┆ B    │
    └────────┴───────┴───────────┴───────┴──────┘
    

    interestingly enough, if I create reduced example of your use case:

    df1 = pl.DataFrame({'year1': [2016, 2000]})
    df2 = pl.DataFrame({'year2': [2001, 2018]})
    
    ┌───────┐
    │ year1 │
    │ ---   │
    │ i64   │
    ╞═══════╡
    │ 2016  │
    │ 2000  │
    └───────┘
    
    ┌───────┐
    │ year2 │
    │ ---   │
    │ i64   │
    ╞═══════╡
    │ 2001  │
    │ 2018  │
    └───────┘
    

    And try to join_asof without sorting then I get the error:

    df1.join_asof(
        df2,
        left_on = 'year1',
        right_on = 'year2',
        strategy = 'nearest'
    )
    
    ---------------------------------------------------------------------------
    ...
    InvalidOperationError: argument in operation 'asof_join' is not explicitly sorted
    
    - If your data is ALREADY sorted, set the sorted flag with: '.set_sorted()'.
    - If your data is NOT sorted, sort the 'expr/series/column' first.
    

    I guess polars doesn't check for (partial) sorting when by_left / by_right / by parameters are present.