First of all, there seem to be some similar questions answered already. However, I couldn't find this specific case, where the conditional columns are also part of the join columns:
I have two dataframes:
df1 = pl.DataFrame({"timestamp": ['2023-01-01 00:00:00', '2023-05-01 00:00:00', '2023-10-01 00:00:00'], "value": [2, 5, 9]})
df1 = df1.with_columns(
pl.col("timestamp").str.to_datetime().alias("timestamp"),
)
┌─────────────────────┬───────┐
│ timestamp ┆ value │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪═══════╡
│ 2023-01-01 00:00:00 ┆ 2 │
│ 2023-05-01 00:00:00 ┆ 5 │
│ 2023-10-01 00:00:00 ┆ 9 │
└─────────────────────┴───────┘
df2 = pl.DataFrame({"date_start": ['2022-12-31 00:00:00', '2023-01-02 00:00:00'], "date_end": ['2023-04-30 00:00:00', '2023-05-05 00:00:00'], "label": [0, 1]})
df2 = df2.with_columns(
pl.col("date_start").str.to_datetime().alias("date_start"),
pl.col("date_end").str.to_datetime().alias("date_end"),
)
┌─────────────────────┬─────────────────────┬───────┐
│ date_start ┆ date_end ┆ label │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ datetime[μs] ┆ i64 │
╞═════════════════════╪═════════════════════╪═══════╡
│ 2022-12-31 00:00:00 ┆ 2023-04-30 00:00:00 ┆ 0 │
│ 2023-01-02 00:00:00 ┆ 2023-05-05 00:00:00 ┆ 1 │
└─────────────────────┴─────────────────────┴───────┘
I want to join label
of the second polars.Dataframe
(df2) onto the first polars.Dataframe
(df1) - but only when the column value of timestamp
(polars.Datetime
) is within the date ranges given in date_start
and date_end
, respectively.
Since I basically want a left join
on df1, the column label
should be None
when the column value of timestamp
isn't at all covered by df2.
The tricky part for me is, that there isn't an actual on
for df2 since its a range of dates.
Assuming that your timestamp
is at the day level you can use pl.date_range
in a group_by
to generate every day between each start_date
and end_date
, then explode those out to join
on.
df_agg = (
df2.group_by("label")
.agg(
pl.date_range(pl.col("date_start"), pl.col("date_end"), "1d").alias("timestamp")
)
.explode("timestamp")
)
df_agg = df1.join(df_agg, how="left", on=["timestamp"])
df_agg
Result:
shape: (5, 3)
┌─────────────────────┬───────┬───────┐
│ timestamp ┆ value ┆ label │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╡
│ 2023-01-01 00:00:00 ┆ 2 ┆ 0 │
│ 2023-03-01 00:00:00 ┆ 4 ┆ 1 │
│ 2023-03-01 00:00:00 ┆ 4 ┆ 0 │
│ 2023-05-01 00:00:00 ┆ 5 ┆ 1 │
│ 2023-10-01 00:00:00 ┆ 9 ┆ null │
└─────────────────────┴───────┴───────┘
Seeing as you noted sqlite
to your answer, using duckdb
might a better solution if you need timestamps at a finer grain. This can pretty transparently convert polars' DataFrames in and out.
import duckdb
ddb = duckdb.connect()
ddf_df = ddb.query(
"select * from df1 left join df2 on df1.timestamp between df2.date_start and df2.date_end"
).pl()
ddf_df
Results:
shape: (5, 5)
┌─────────────────────┬───────┬─────────────────────┬─────────────────────┬───────┐
│ timestamp ┆ value ┆ date_start ┆ date_end ┆ label │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ datetime[μs] ┆ datetime[μs] ┆ i64 │
╞═════════════════════╪═══════╪═════════════════════╪═════════════════════╪═══════╡
│ 2023-01-01 00:00:00 ┆ 2 ┆ 2022-12-31 00:00:00 ┆ 2023-04-30 00:00:00 ┆ 0 │
│ 2023-03-01 00:00:00 ┆ 4 ┆ 2022-12-31 00:00:00 ┆ 2023-04-30 00:00:00 ┆ 0 │
│ 2023-03-01 00:00:00 ┆ 4 ┆ 2023-01-02 00:00:00 ┆ 2023-05-05 00:00:00 ┆ 1 │
│ 2023-05-01 00:00:00 ┆ 5 ┆ 2023-01-02 00:00:00 ┆ 2023-05-05 00:00:00 ┆ 1 │
│ 2023-10-01 00:00:00 ┆ 9 ┆ null ┆ null ┆ null │
└─────────────────────┴───────┴─────────────────────┴─────────────────────┴───────┘