Search code examples
pythonjoinpython-polars

Python Polars: Conditional Join by Date Range


First of all, there seem to be some similar questions answered already. However, I couldn't find this specific case, where the conditional columns are also part of the join columns:

I have two dataframes:

df1 = pl.DataFrame({"timestamp": ['2023-01-01 00:00:00', '2023-05-01 00:00:00', '2023-10-01 00:00:00'], "value": [2, 5, 9]})

df1 = df1.with_columns(
    pl.col("timestamp").str.to_datetime().alias("timestamp"),
)

┌─────────────────────┬───────┐
│ timestamp           ┆ value │
│ ---                 ┆ ---   │
│ datetime[μs]        ┆ i64   │
╞═════════════════════╪═══════╡
│ 2023-01-01 00:00:00 ┆ 2     │
│ 2023-05-01 00:00:00 ┆ 5     │
│ 2023-10-01 00:00:00 ┆ 9     │
└─────────────────────┴───────┘

df2 = pl.DataFrame({"date_start": ['2022-12-31 00:00:00', '2023-01-02 00:00:00'], "date_end": ['2023-04-30 00:00:00', '2023-05-05 00:00:00'], "label": [0, 1]})

df2 = df2.with_columns(
    pl.col("date_start").str.to_datetime().alias("date_start"),
    pl.col("date_end").str.to_datetime().alias("date_end"),
)

┌─────────────────────┬─────────────────────┬───────┐
│ date_start          ┆ date_end            ┆ label │
│ ---                 ┆ ---                 ┆ ---   │
│ datetime[μs]        ┆ datetime[μs]        ┆ i64   │
╞═════════════════════╪═════════════════════╪═══════╡
│ 2022-12-31 00:00:00 ┆ 2023-04-30 00:00:00 ┆ 0     │
│ 2023-01-02 00:00:00 ┆ 2023-05-05 00:00:00 ┆ 1     │
└─────────────────────┴─────────────────────┴───────┘

I want to join label of the second polars.Dataframe (df2) onto the first polars.Dataframe (df1) - but only when the column value of timestamp (polars.Datetime) is within the date ranges given in date_start and date_end, respectively.
Since I basically want a left join on df1, the column label should be None when the column value of timestamp isn't at all covered by df2.

The tricky part for me is, that there isn't an actual on for df2 since its a range of dates.


Solution

  • Pure polars Solution

    Assuming that your timestamp is at the day level you can use pl.date_range in a group_by to generate every day between each start_date and end_date, then explode those out to join on.

    df_agg = (
        df2.group_by("label")
        .agg(
            pl.date_range(pl.col("date_start"), pl.col("date_end"), "1d").alias("timestamp")
        )
        .explode("timestamp")
    )
    
    df_agg = df1.join(df_agg, how="left", on=["timestamp"])
    
    df_agg
    

    Result:

    shape: (5, 3)
    ┌─────────────────────┬───────┬───────┐
    │ timestamp           ┆ value ┆ label │
    │ ---                 ┆ ---   ┆ ---   │
    │ datetime[μs]        ┆ i64   ┆ i64   │
    ╞═════════════════════╪═══════╪═══════╡
    │ 2023-01-01 00:00:00 ┆ 2     ┆ 0     │
    │ 2023-03-01 00:00:00 ┆ 4     ┆ 1     │
    │ 2023-03-01 00:00:00 ┆ 4     ┆ 0     │
    │ 2023-05-01 00:00:00 ┆ 5     ┆ 1     │
    │ 2023-10-01 00:00:00 ┆ 9     ┆ null  │
    └─────────────────────┴───────┴───────┘
    

    Using duckdb

    Seeing as you noted sqlite to your answer, using duckdb might a better solution if you need timestamps at a finer grain. This can pretty transparently convert polars' DataFrames in and out.

    import duckdb
    
    ddb = duckdb.connect()
    ddf_df = ddb.query(
        "select * from df1 left join df2 on df1.timestamp between df2.date_start and df2.date_end"
    ).pl()
    
    ddf_df
    

    Results:

    shape: (5, 5)
    ┌─────────────────────┬───────┬─────────────────────┬─────────────────────┬───────┐
    │ timestamp           ┆ value ┆ date_start          ┆ date_end            ┆ label │
    │ ---                 ┆ ---   ┆ ---                 ┆ ---                 ┆ ---   │
    │ datetime[μs]        ┆ i64   ┆ datetime[μs]        ┆ datetime[μs]        ┆ i64   │
    ╞═════════════════════╪═══════╪═════════════════════╪═════════════════════╪═══════╡
    │ 2023-01-01 00:00:00 ┆ 2     ┆ 2022-12-31 00:00:00 ┆ 2023-04-30 00:00:00 ┆ 0     │
    │ 2023-03-01 00:00:00 ┆ 4     ┆ 2022-12-31 00:00:00 ┆ 2023-04-30 00:00:00 ┆ 0     │
    │ 2023-03-01 00:00:00 ┆ 4     ┆ 2023-01-02 00:00:00 ┆ 2023-05-05 00:00:00 ┆ 1     │
    │ 2023-05-01 00:00:00 ┆ 5     ┆ 2023-01-02 00:00:00 ┆ 2023-05-05 00:00:00 ┆ 1     │
    │ 2023-10-01 00:00:00 ┆ 9     ┆ null                ┆ null                ┆ null  │
    └─────────────────────┴───────┴─────────────────────┴─────────────────────┴───────┘