Search code examples
filteraggregatepython-polars

Polars - How to find all the records in a dataframe which have at least one in-kind occurrence between two given datetimes?


I two dataframes: one for store operation dates and one for transactions.

The store hours dataframe is like so

┌───────────┬─────────────────────┬─────────────────────┐
│ vendorId  ┆ opening time        ┆ closing time        │
│ ---       ┆ ---                 ┆ ---                 │
│ str       ┆ datetime[μs]        ┆ datetime[μs]        │
╞═══════════╪═════════════════════╪═════════════════════╡
╞═══════════╪═════════════════════╪═════════════════════╡
│ 115       ┆ 2020-02-25 12:00:01 ┆ 2022-02-25 02:00:02 │
│ 1146      ┆ 2020-02-16 12:00:11 ┆ 2022-02-26 02:00:48 │
│ 18143     ┆ 2020-02-25 12:00:16 ┆ 2022-02-25 02:00:46 │
│ 19122     ┆ 2020-02-16 12:00:19 ┆ 2022-02-16 02:00:42 │
│ …         ┆ …                   ┆ …                   │
│ 7744      ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
│ 09092     ┆ 2020-02-16 12:00:31 ┆ 2022-02-24 02:00:57 │
│ 5801      ┆ 2020-11-01 12:00:41 ┆ 2022-02-19 02:00:36 │
│ 2378      ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
└───────────┴─────────────────────┴─────────────────────┘

The transactions df looks something like this:

┌────────┬───────┬─────────────────────┬───────────┬──────────┬───────────┐
│ vendor ┆ prod  ┆ saletime            ┆ value     ┆ count    ┆ quality   │
│ ---    ┆ ---   ┆ ---                 ┆ ---       ┆ ---      ┆ ---       │
│ str    ┆ str   ┆datetime             ┆ f64       ┆ i64      ┆ f64       │
╞════════╪═══════╪═════════════════════╪═══════════╪══════════╪═══════════╡
│ 7744   ┆ 12    ┆ 2020-02-25 12:10:01 ┆ 10.0      ┆ 6        ┆ 13.7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7744   ┆ 13    ┆ 2020-02-25 12:15:01 ┆ 12.0      ┆ 7        ┆ 21.9      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7744   ┆ 11    ┆ 2023-02-25 12:17:01 ┆ 10.0      ┆ 4        ┆ 13.0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378   ┆ 20    ┆ 2020-02-16 12:18:01 ┆ 10.0      ┆ 2        ┆ 14.0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378   ┆ 20    ┆ 2022-02-25 12:20:01 ┆ 10.0      ┆ 2        ┆ 14.0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 115    ┆ 3     ┆ 2022-02-26 12:10:01 ┆ 12.0      ┆ 5        ┆ 19.0      │
└────────┴───────┴─────────────────────┴───────────┴──────────┴───────────┘

My goal is for each vendor, I want to return all the product transactions which had at least one transactions within the stores hours. For example, for vendor 7744 it would be the following.

┌────────┬───────┬─────────────────────┬───────────┬──────────┬───────────┐
│ vendor ┆ prod  ┆ saletime            ┆ value     ┆ count    ┆ quality   │
│ ---    ┆ ---   ┆ ---                 ┆ ---       ┆ ---      ┆ ---       │
│ str    ┆ str   ┆datetime             ┆ f64       ┆ i64      ┆ f64       │
╞════════╪═══════╪═════════════════════╪═══════════╪══════════╪═══════════╡
│ 7744   ┆ 12    ┆ 2020-02-25 12:10:01 ┆ 10.0      ┆ 6        ┆ 13.7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7744   ┆ 13    ┆ 2020-02-25 12:15:01 ┆ 12.0      ┆ 7        ┆ 21.9      │
└────────┴───────┴─────────────────────┴───────────┴──────────┴───────────┘

These products were sold within the opening and closing of the store while product 11 occurred after the closing date.

On the other hand for vendor 2378 it should return:

┌────────┬───────┬─────────────────────┬───────────┬──────────┬───────────┐
│ vendor ┆ prod  ┆ saletime            ┆ value     ┆ count    ┆ quality   │
│ ---    ┆ ---   ┆ ---                 ┆ ---       ┆ ---      ┆ ---       │
│ str    ┆ str   ┆datetime             ┆ f64       ┆ i64      ┆ f64       │
╞════════╪═══════╪═════════════════════╪═══════════╪══════════╪═══════════╡
│ 2378   ┆ 20    ┆ 2020-02-16 12:18:01 ┆ 10.0      ┆ 2        ┆ 14.0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378   ┆ 20    ┆ 2022-02-25 12:20:01 ┆ 10.0      ┆ 2        ┆ 14.0      │
└────────┴───────┴─────────────────────┴───────────┴──────────┴───────────┘

Even though the second transaction occurred after the closing date, it is considered an ongoing transaction because part of the transaction occurred during the stores operation dates.

The expected final df is

┌────────┬───────┬─────────────────────┬───────────┬──────────┬───────────┐
│ vendor ┆ prod  ┆ saletime            ┆ value     ┆ count    ┆ quality   │
│ ---    ┆ ---   ┆ ---                 ┆ ---       ┆ ---      ┆ ---       │
│ str    ┆ str   ┆datetime             ┆ f64       ┆ i64      ┆ f64       │
╞════════╪═══════╪═════════════════════╪═══════════╪══════════╪═══════════╡
│ 7744   ┆ 12    ┆ 2020-02-25 12:10:01 ┆ 10.0      ┆ 6        ┆ 13.7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7744   ┆ 13    ┆ 2020-02-25 12:15:01 ┆ 12.0      ┆ 7        ┆ 21.9      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378   ┆ 20    ┆ 2020-02-16 12:18:01 ┆ 10.0      ┆ 2        ┆ 14.0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2378   ┆ 20    ┆ 2022-02-25 12:20:01 ┆ 10.0      ┆ 2        ┆ 14.0      │
└────────┴───────┴─────────────────────┴───────────┴──────────┴───────────┘

Is there an efficient way to accomplish this? Any help is appreciated.


Solution

  • In your question you say:

    Even though the second transaction occurred after the closing date, it is considered an ongoing transaction because part of the transaction occurred during the stores operation dates.

    so I think you meant for your transactions to look something more like this:

    txns=pl.from_repr(
     """shape: (6, 6)
        ┌────────┬──────┬─────────────────────┬───────┬───────┬─────────┐
        │ vendor ┆ prod ┆ saletime            ┆ value ┆ count ┆ quality │
        │ ---    ┆ ---  ┆ ---                 ┆ ---   ┆ ---   ┆ ---     │
        │ str    ┆ str  ┆ datetime[μs]        ┆ f64   ┆ i64   ┆ f64     │
        ╞════════╪══════╪═════════════════════╪═══════╪═══════╪═════════╡
        │ 7744   ┆ 12   ┆ 2020-02-25 12:10:01 ┆ 10.0  ┆ 6     ┆ 13.7    │
        │ 7744   ┆ 13   ┆ 2020-02-25 12:15:01 ┆ 12.0  ┆ 7     ┆ 21.9    │
        │ 7744   ┆ 11   ┆ 2023-02-25 12:17:01 ┆ 10.0  ┆ 4     ┆ 13.0    │
        │ 2378   ┆ 20   ┆ 2020-02-16 12:18:01 ┆ 10.0  ┆ 2     ┆ 14.0    │
        │ 2378   ┆ 20   ┆ 2022-02-25 12:20:01 ┆ 10.0  ┆ 2     ┆ 14.0    │
        │ 115    ┆ 3    ┆ 2022-02-26 12:10:01 ┆ 12.0  ┆ 5     ┆ 19.0    │
        └────────┴──────┴─────────────────────┴───────┴───────┴─────────┘""")
    

    where the second 2378 row has a transaction time after the closing time from the hours df.

    hours=pl.from_repr(
         """┌───────────┬─────────────────────┬─────────────────────┐
            │ vendorId  ┆ opening time        ┆ closing time        │
            │ ---       ┆ ---                 ┆ ---                 │
            │ str       ┆ datetime[μs]        ┆ datetime[μs]        │
            ╞═══════════╪═════════════════════╪═════════════════════╡
            │ 115       ┆ 2020-02-25 12:00:01 ┆ 2022-02-25 02:00:02 │
            │ 1146      ┆ 2020-02-16 12:00:11 ┆ 2022-02-26 02:00:48 │
            │ 18143     ┆ 2020-02-25 12:00:16 ┆ 2022-02-25 02:00:46 │
            │ 19122     ┆ 2020-02-16 12:00:19 ┆ 2022-02-16 02:00:42 │
            │ …         ┆ …                   ┆ …                   │
            │ 7744      ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:57 │
            │ 09092     ┆ 2020-02-16 12:00:31 ┆ 2022-02-24 02:00:57 │
            │ 5801      ┆ 2020-11-01 12:00:41 ┆ 2022-02-19 02:00:36 │
            │ 2378      ┆ 2020-02-16 12:00:11 ┆ 2022-02-16 02:00:47 │
            └───────────┴─────────────────────┴─────────────────────┘""")
    

    With that, if you want to return all the transactions where at least one transaction/vendor pair are within the open/close time even if the particular row isn't then you would use a window function (ie. .over) in your filter.

    (
        txns
            .join(hours,  left_on="vendor", right_on="vendorId")
            .filter(
                (
                    pl.col('saletime')
                    .is_between(pl.col("opening time"), pl.col("closing time"))
                    .any()
                ).over(['vendor','prod']))
            .select(txns.columns)
    )
    
    shape: (4, 6)
    ┌────────┬──────┬─────────────────────┬───────┬───────┬─────────┐
    │ vendor ┆ prod ┆ saletime            ┆ value ┆ count ┆ quality │
    │ ---    ┆ ---  ┆ ---                 ┆ ---   ┆ ---   ┆ ---     │
    │ str    ┆ str  ┆ datetime[μs]        ┆ f64   ┆ i64   ┆ f64     │
    ╞════════╪══════╪═════════════════════╪═══════╪═══════╪═════════╡
    │ 7744   ┆ 12   ┆ 2020-02-25 12:10:01 ┆ 10.0  ┆ 6     ┆ 13.7    │
    │ 7744   ┆ 13   ┆ 2020-02-25 12:15:01 ┆ 12.0  ┆ 7     ┆ 21.9    │
    │ 2378   ┆ 20   ┆ 2020-02-16 12:18:01 ┆ 10.0  ┆ 2     ┆ 14.0    │
    │ 2378   ┆ 20   ┆ 2022-02-25 12:20:01 ┆ 10.0  ┆ 2     ┆ 14.0    │
    └────────┴──────┴─────────────────────┴───────┴───────┴─────────┘