First of all, there seem to be some similar questions answered already. However, I couldn't find this specific case, where the conditional columns are also part of the join columns:
I have two dataframes:
df1 = pl.DataFrame({"timestamp": ['2023-01-01 00:00:00', '2023-05-01 00:00:00', '2023-10-01 00:00:00'], "value": [2, 5, 9]})
df1 = df1.with_columns(
pl.col("timestamp").str.to_datetime().alias("timestamp"),
)
┌─────────────────────┬───────┐
│ timestamp ┆ value │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪═══════╡
│ 2023-01-01 00:00:00 ┆ 2 │
│ 2023-05-01 00:00:00 ┆ 5 │
│ 2023-10-01 00:00:00 ┆ 9 │
└─────────────────────┴───────┘
df2 = pl.DataFrame({"date_start": ['2022-12-31 00:00:00', '2023-01-02 00:00:00'], "date_end": ['2023-04-30 00:00:00', '2023-05-05 00:00:00'], "label": [0, 1]})
df2 = df2.with_columns(
pl.col("date_start").str.to_datetime().alias("date_start"),
pl.col("date_end").str.to_datetime().alias("date_end"),
)
┌─────────────────────┬─────────────────────┬───────┐
│ date_start ┆ date_end ┆ label │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ datetime[μs] ┆ i64 │
╞═════════════════════╪═════════════════════╪═══════╡
│ 2022-12-31 00:00:00 ┆ 2023-04-30 00:00:00 ┆ 0 │
│ 2023-01-02 00:00:00 ┆ 2023-05-05 00:00:00 ┆ 1 │
└─────────────────────┴─────────────────────┴───────┘
I want to join label
of the second polars.Dataframe
(df2) onto the first polars.Dataframe
(df1) - but only when the column value of timestamp
(polars.Datetime
) is within the date ranges given in date_start
and date_end
, respectively.
Since I basically want a left join
on df1, the column label
should be None
when the column value of timestamp
isn't at all covered by df2.
The tricky part for me is, that there isn't an actual on
for df2 since its a range of dates.
.join_where()
was added in Polars 1.7.0
(df1
.join_where(df2,
pl.col.timestamp >= pl.col.date_start,
pl.col.timestamp <= pl.col.date_end
)
)
shape: (2, 5)
┌─────────────────────┬───────┬─────────────────────┬─────────────────────┬───────┐
│ timestamp ┆ value ┆ date_start ┆ date_end ┆ label │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ i64 │
╞═════════════════════╪═══════╪═════════════════════╪═════════════════════╪═══════╡
│ 2023-05-01 00:00:00 ┆ 5 ┆ 2023-01-02 00:00:00 ┆ 2023-05-05 00:00:00 ┆ 1 │
│ 2023-01-01 00:00:00 ┆ 2 ┆ 2022-12-31 00:00:00 ┆ 2023-04-30 00:00:00 ┆ 0 │
└─────────────────────┴───────┴─────────────────────┴─────────────────────┴───────┘
It currently supports INNER JOIN
- so an additional LEFT JOIN
is required in thie case.
(df1
.with_row_index()
.join(
df1
.with_row_index()
.join_where(df2,
pl.col.timestamp >= pl.col.date_start,
pl.col.timestamp <= pl.col.date_end
)
.select("index", "label"),
on = "index",
how = "left"
)
)
shape: (3, 4)
┌───────┬─────────────────────┬───────┬───────┐
│ index ┆ timestamp ┆ value ┆ label │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ datetime[μs] ┆ i64 ┆ i64 │
╞═══════╪═════════════════════╪═══════╪═══════╡
│ 0 ┆ 2023-01-01 00:00:00 ┆ 2 ┆ 0 │
│ 1 ┆ 2023-05-01 00:00:00 ┆ 5 ┆ 1 │
│ 2 ┆ 2023-10-01 00:00:00 ┆ 9 ┆ null │
└───────┴─────────────────────┴───────┴───────┘
Additional join types are being tracked here: