Search code examples
python-polars

How to get right data from other dataframe


Instead of getting actual data, I get a df of query plans objects.

import polars as pl

df1 = pl.DataFrame({
    "df1_date": [20221011, 20221012, 20221013, 20221014, 20221016],
    "df1_col1": ["foo", "bar", "foo", "bar", "foo"],
})

df2 = pl.DataFrame({
    "df2_date": [20221012, 20221015, 20221018],
    "df2_col1": ["1", "2", "3"],
})

print(
    df1.lazy()
    .with_context(df2.lazy())
    .select(
        pl.col("df1_date")
        .map_elements(lambda s: pl.col("df2_date").filter(pl.col("df2_date") >= s).first())
        .alias("release_date")
    )
    .collect()
)
shape: (5, 1)
┌─────────────────────────────────┐
│ release_date                    │
│ ---                             │
│ object                          │
╞═════════════════════════════════╡
│ col("df2_date").filter([(col("… │
│ col("df2_date").filter([(col("… │
│ col("df2_date").filter([(col("… │
│ col("df2_date").filter([(col("… │
│ col("df2_date").filter([(col("… │
└─────────────────────────────────┘

In pandas, I can get what I want by using:

df1 = df1.to_pandas().set_index("df1_date")
df2 = df2.to_pandas().set_index("df2_date")

df1["release_date"] = df1.index.map(
    lambda x: df2[df2.index <= x].index[-1] if len(df2[df2.index <= x]) > 0 else 0
)
print(df1)
shape: (5, 3)
┌──────────┬──────────┬──────────────┐
│ df1_date ┆ df1_col1 ┆ release_date │
│ ---      ┆ ---      ┆ ---          │
│ i64      ┆ str      ┆ i64          │
╞══════════╪══════════╪══════════════╡
│ 20221011 ┆ foo      ┆ 0            │
│ 20221012 ┆ bar      ┆ 20221012     │
│ 20221013 ┆ foo      ┆ 20221012     │
│ 20221014 ┆ bar      ┆ 20221012     │
│ 20221016 ┆ foo      ┆ 20221015     │
└──────────┴──────────┴──────────────┘

How can I get the expected result using Polars?


Solution

  • It looks like you're trying to do an asof join. In other words a join where you take the last value that matched rather than exact matches.

    You can do

    df1 = (df1.lazy().join_asof(df2.lazy(), left_on='df1_date', right_on='df2_date')) \
               .select(['df1_date', 'df1_col1',
                        pl.col('df2_date').fill_null(0).alias('release_date')]).collect()
    

    The first difference is that in polars you don't assign new columns, you assign the whole df so it's always just the name of the df on the left side of the equals. The join_asof replaces your index/map/lambda thing. Then the last thing is to just replace the null value with 0 with fill_null and then rename the column. There was a bug in an old version of polars preventing the collect from working at the end. That is fixed in at least 0.15.1 (maybe an earlier version too but I'm just checking in with that version)