Instead of getting actual data, I get a df of query plans objects.
import polars as pl
df1 = pl.DataFrame({
"df1_date": [20221011, 20221012, 20221013, 20221014, 20221016],
"df1_col1": ["foo", "bar", "foo", "bar", "foo"],
})
df2 = pl.DataFrame({
"df2_date": [20221012, 20221015, 20221018],
"df2_col1": ["1", "2", "3"],
})
print(
df1.lazy()
.with_context(df2.lazy())
.select(
pl.col("df1_date")
.map_elements(lambda s: pl.col("df2_date").filter(pl.col("df2_date") >= s).first())
.alias("release_date")
)
.collect()
)
shape: (5, 1)
┌─────────────────────────────────┐
│ release_date │
│ --- │
│ object │
╞═════════════════════════════════╡
│ col("df2_date").filter([(col("… │
│ col("df2_date").filter([(col("… │
│ col("df2_date").filter([(col("… │
│ col("df2_date").filter([(col("… │
│ col("df2_date").filter([(col("… │
└─────────────────────────────────┘
In pandas, I can get what I want by using:
df1 = df1.to_pandas().set_index("df1_date")
df2 = df2.to_pandas().set_index("df2_date")
df1["release_date"] = df1.index.map(
lambda x: df2[df2.index <= x].index[-1] if len(df2[df2.index <= x]) > 0 else 0
)
print(df1)
shape: (5, 3)
┌──────────┬──────────┬──────────────┐
│ df1_date ┆ df1_col1 ┆ release_date │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞══════════╪══════════╪══════════════╡
│ 20221011 ┆ foo ┆ 0 │
│ 20221012 ┆ bar ┆ 20221012 │
│ 20221013 ┆ foo ┆ 20221012 │
│ 20221014 ┆ bar ┆ 20221012 │
│ 20221016 ┆ foo ┆ 20221015 │
└──────────┴──────────┴──────────────┘
How can I get the expected result using Polars?
It looks like you're trying to do an asof join. In other words a join where you take the last value that matched rather than exact matches.
You can do
df1 = (df1.lazy().join_asof(df2.lazy(), left_on='df1_date', right_on='df2_date')) \
.select(['df1_date', 'df1_col1',
pl.col('df2_date').fill_null(0).alias('release_date')]).collect()
The first difference is that in polars you don't assign new columns, you assign the whole df so it's always just the name of the df on the left side of the equals. The join_asof
replaces your index/map/lambda thing. Then the last thing is to just replace the null value with 0 with fill_null
and then rename the column. There was a bug in an old version of polars preventing the collect
from working at the end. That is fixed in at least 0.15.1 (maybe an earlier version too but I'm just checking in with that version)