Search code examples
pythonjoinpython-polars

Python Polars: Conditional Join by Date Range


First of all, there seem to be some similar questions answered already. However, I couldn't find this specific case, where the conditional columns are also part of the join columns:

I have two dataframes:

df1 = pl.DataFrame({"timestamp": ['2023-01-01 00:00:00', '2023-05-01 00:00:00', '2023-10-01 00:00:00'], "value": [2, 5, 9]})

df1 = df1.with_columns(
    pl.col("timestamp").str.to_datetime().alias("timestamp"),
)

┌─────────────────────┬───────┐
│ timestamp           ┆ value │
│ ---                 ┆ ---   │
│ datetime[μs]        ┆ i64   │
╞═════════════════════╪═══════╡
│ 2023-01-01 00:00:00 ┆ 2     │
│ 2023-05-01 00:00:00 ┆ 5     │
│ 2023-10-01 00:00:00 ┆ 9     │
└─────────────────────┴───────┘

df2 = pl.DataFrame({"date_start": ['2022-12-31 00:00:00', '2023-01-02 00:00:00'], "date_end": ['2023-04-30 00:00:00', '2023-05-05 00:00:00'], "label": [0, 1]})

df2 = df2.with_columns(
    pl.col("date_start").str.to_datetime().alias("date_start"),
    pl.col("date_end").str.to_datetime().alias("date_end"),
)

┌─────────────────────┬─────────────────────┬───────┐
│ date_start          ┆ date_end            ┆ label │
│ ---                 ┆ ---                 ┆ ---   │
│ datetime[μs]        ┆ datetime[μs]        ┆ i64   │
╞═════════════════════╪═════════════════════╪═══════╡
│ 2022-12-31 00:00:00 ┆ 2023-04-30 00:00:00 ┆ 0     │
│ 2023-01-02 00:00:00 ┆ 2023-05-05 00:00:00 ┆ 1     │
└─────────────────────┴─────────────────────┴───────┘

I want to join label of the second polars.Dataframe (df2) onto the first polars.Dataframe (df1) - but only when the column value of timestamp (polars.Datetime) is within the date ranges given in date_start and date_end, respectively.
Since I basically want a left join on df1, the column label should be None when the column value of timestamp isn't at all covered by df2.

The tricky part for me is, that there isn't an actual on for df2 since its a range of dates.


Solution

  • .join_where() was added in Polars 1.7.0

    (df1
      .join_where(df2,
         pl.col.timestamp >= pl.col.date_start,
         pl.col.timestamp <= pl.col.date_end
      )
    )
    
    shape: (2, 5)
    ┌─────────────────────┬───────┬─────────────────────┬─────────────────────┬───────┐
    │ timestamp           ┆ value ┆ date_start          ┆ date_end            ┆ label │
    │ ---                 ┆ ---   ┆ ---                 ┆ ---                 ┆ ---   │
    │ datetime[μs]        ┆ i64   ┆ datetime[μs]        ┆ datetime[μs]        ┆ i64   │
    ╞═════════════════════╪═══════╪═════════════════════╪═════════════════════╪═══════╡
    │ 2023-05-01 00:00:00 ┆ 5     ┆ 2023-01-02 00:00:00 ┆ 2023-05-05 00:00:00 ┆ 1     │
    │ 2023-01-01 00:00:00 ┆ 2     ┆ 2022-12-31 00:00:00 ┆ 2023-04-30 00:00:00 ┆ 0     │
    └─────────────────────┴───────┴─────────────────────┴─────────────────────┴───────┘
    

    It currently supports INNER JOIN - so an additional LEFT JOIN is required in thie case.

    (df1
      .with_row_index()
      .join(
         df1
          .with_row_index()
          .join_where(df2,
             pl.col.timestamp >= pl.col.date_start,
             pl.col.timestamp <= pl.col.date_end
          )
          .select("index", "label"),
         on = "index",
         how = "left"
      )
    )
    
    shape: (3, 4)
    ┌───────┬─────────────────────┬───────┬───────┐
    │ index ┆ timestamp           ┆ value ┆ label │
    │ ---   ┆ ---                 ┆ ---   ┆ ---   │
    │ u32   ┆ datetime[μs]        ┆ i64   ┆ i64   │
    ╞═══════╪═════════════════════╪═══════╪═══════╡
    │ 0     ┆ 2023-01-01 00:00:00 ┆ 2     ┆ 0     │
    │ 1     ┆ 2023-05-01 00:00:00 ┆ 5     ┆ 1     │
    │ 2     ┆ 2023-10-01 00:00:00 ┆ 9     ┆ null  │
    └───────┴─────────────────────┴───────┴───────┘
    

    Additional join types are being tracked here: