Here are two data frames
df1 = pl.DataFrame({'title1': ['abc', 'abc', 'def'],
'year1': [2016, 2000, 1934]})
df2 = pl.DataFrame({'title2': ['abc', 'abc', 'def'],
'director2': ['john', 'paul', 'peter'],
'year2': [2001, 2018, 2016],
'val2': ['A', 'B', 'C']})
Doing the following, I would expect row 3 of df1
and df2
to match, which it does. But I would also expect row 1 and 2 to match with rows 2 and 1 of df1
and df2
respectively, since 2016 is closer to 2018 and 2000 is closer to 2001.
df1.join_asof(df2, left_on = 'year1', right_on = 'year2', by_left = ['title1'], by_right = ['title2'], strategy = 'nearest')
To be clear, this is what I would expect the output to be:
pl.DataFrame({'title1': ["abc", "abc", "def"],
'year1': [2016, 2000, 1934],
'director2': ["paul", "john", "peter"],
'year2': [2018, 2001, 2016],
'val2': ["B", "A", "C"]})
But rows 1 and 2 of df1
are both matching with row 2 of df2
instead. The above line of code actually gives me:
pl.DataFrame({'title1': ["abc", "abc", "def"],
'year1': [2016, 2000, 1934],
'director2': ["paul", "paul", "peter"],
'year2': [2018, 2018, 2016],
'val2': ["B", "B", "C"]})
My point of confusion is in bold - why is this not happening? Does it have something to do with the 'abc' group being duplicated?
I think the problem is that your DataFrames are not sorted while documentation says
Both DataFrames must be sorted by the asof_join key.
df1 = df1.sort('year1')
df2 = df2.sort('year2')
df1.join_asof(
df2,
left_on = 'year1',
right_on = 'year2',
by_left = ['title1'],
by_right = ['title2'],
strategy = 'nearest'
)
┌────────┬───────┬───────────┬───────┬──────┐
│ title1 ┆ year1 ┆ director2 ┆ year2 ┆ val2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 ┆ str │
╞════════╪═══════╪═══════════╪═══════╪══════╡
│ def ┆ 1934 ┆ peter ┆ 2016 ┆ C │
│ abc ┆ 2000 ┆ john ┆ 2001 ┆ A │
│ abc ┆ 2016 ┆ paul ┆ 2018 ┆ B │
└────────┴───────┴───────────┴───────┴──────┘
interestingly enough, if I create reduced example of your use case:
df1 = pl.DataFrame({'year1': [2016, 2000]})
df2 = pl.DataFrame({'year2': [2001, 2018]})
┌───────┐
│ year1 │
│ --- │
│ i64 │
╞═══════╡
│ 2016 │
│ 2000 │
└───────┘
┌───────┐
│ year2 │
│ --- │
│ i64 │
╞═══════╡
│ 2001 │
│ 2018 │
└───────┘
And try to join_asof
without sorting then I get the error:
df1.join_asof(
df2,
left_on = 'year1',
right_on = 'year2',
strategy = 'nearest'
)
---------------------------------------------------------------------------
...
InvalidOperationError: argument in operation 'asof_join' is not explicitly sorted
- If your data is ALREADY sorted, set the sorted flag with: '.set_sorted()'.
- If your data is NOT sorted, sort the 'expr/series/column' first.
I guess polars doesn't check for (partial) sorting when by_left
/ by_right
/ by
parameters are present.