Search code examples
python-polars

Join between Polars dataframes with inequality conditions


I would like to do a join between two dataframes, using as join condition an inequality condition, i.e. greater than.

Given two dataframes, I would like to get the result equivalent to the SQL written below.

stock_market_value = pl.DataFrame(
    {
        "date": [date(2022, 1, 1), date(2022, 2, 1), date(2022, 3, 1)],
        "price": [10.00, 12.00, 14.00]

    }
)

my_stock_orders = pl.DataFrame(
    {
        "date": [date(2022, 1, 15), date(2022, 2, 15)],
        "quantity": [2, 5] 
    }
)

I have read that Polars supports join of type asof, but I don't think it applies to my case (maybe putting tolerance equal to infinity?).

For sake of clarity, I wrote the join in form of SQL statement.

SELECT m.date, m.price * o.quantity AS portfolio_value
FROM stock_market_value m LEFT JOIN my_stock_orders o
  ON m.date >= o.date

Example query/output:

duckdb.sql("""
SELECT 
   m.date market_date, 
   o.date order_date, 
   price,
   quantity, 
   price * quantity AS portfolio_value
FROM stock_market_value m LEFT JOIN my_stock_orders o
  ON m.date >= o.date
""").pl()
shape: (4, 5)
┌─────────────┬────────────┬───────┬──────────┬─────────────────┐
│ market_date | order_date | price | quantity | portfolio_value │
│ ---         | ---        | ---   | ---      | ---             │
│ date        | date       | f64   | i64      | f64             │
╞═════════════╪════════════╪═══════╪══════════╪═════════════════╡
│ 2022-01-01  | null       | 10.0  | null     | null            │
│ 2022-02-01  | 2022-01-15 | 12.0  | 2        | 24.0            │
│ 2022-03-01  | 2022-01-15 | 14.0  | 2        | 28.0            │
│ 2022-03-01  | 2022-02-15 | 14.0  | 5        | 70.0            │
└─────────────┴────────────┴───────┴──────────┴─────────────────┘

Why asof() is not the solution

Comments were suggesting to use asof, but it actually does not work in the way I expect.

Forward asof

result_fwd = stock_market_value.join_asof(
    my_stock_orders, left_on="date", right_on="date", strategy="forward"
)

print(result_fwd)
shape: (3, 3)
┌────────────┬───────┬──────────┐
│ date       ┆ price ┆ quantity │
│ ---        ┆ ---   ┆ ---      │
│ date       ┆ f64   ┆ i64      │
╞════════════╪═══════╪══════════╡
│ 2022-01-01 ┆ 10.0  ┆ 2        │
│ 2022-02-01 ┆ 12.0  ┆ 5        │
│ 2022-03-01 ┆ 14.0  ┆ null     │
└────────────┴───────┴──────────┘

Backward asof

result_bwd = stock_market_value.join_asof(
    my_stock_orders, left_on="date", right_on="date", strategy="backward"
)

print(result_bwd)
shape: (3, 3)
┌────────────┬───────┬──────────┐
│ date       ┆ price ┆ quantity │
│ ---        ┆ ---   ┆ ---      │
│ date       ┆ f64   ┆ i64      │
╞════════════╪═══════╪══════════╡
│ 2022-01-01 ┆ 10.0  ┆ null     │
│ 2022-02-01 ┆ 12.0  ┆ 2        │
│ 2022-03-01 ┆ 14.0  ┆ 5        │
└────────────┴───────┴──────────┘

Thanks!


Solution

  • .join_where() was added in Polars 1.7.0

    (stock_market_value
      .join_where(my_stock_orders,
         pl.col.date >= pl.col.date_right
      )
    )
    
    shape: (3, 4)
    ┌────────────┬───────┬────────────┬──────────┐
    │ date       ┆ price ┆ date_right ┆ quantity │
    │ ---        ┆ ---   ┆ ---        ┆ ---      │
    │ date       ┆ f64   ┆ date       ┆ i64      │
    ╞════════════╪═══════╪════════════╪══════════╡
    │ 2022-02-01 ┆ 12.0  ┆ 2022-01-15 ┆ 2        │
    │ 2022-03-01 ┆ 14.0  ┆ 2022-01-15 ┆ 2        │
    │ 2022-03-01 ┆ 14.0  ┆ 2022-02-15 ┆ 5        │
    └────────────┴───────┴────────────┴──────────┘
    

    It currently supports INNER JOIN - so you would need an additional join to generate the full expected output here.

    Additional join types is being tracked here: