Filter polars dataframe using given time intervals

Saying I have 2 dataframes:

discount_dates

product	discount_start_date	discount_end_date
Iphone	2012-12-31	2013-03-03
Iphone	2017-01-31	2019-12-03
Macbook	2013-02-22	2013-03-03
Iwatch	2012-12-31	2013-03-03
Iwatch	2013-12-31	2014-03-03
Iwatch	2014-12-31	2015-03-03

data_product

product	date	n_sales
Iphone	2013-03-01	1432
Iphone	2013-03-02	2314
Iphone	2013-03-03	3200
Iphone	2013-03-04	123
Iphone	2017-02-04	2230
Iwatch	2013-04-01	102
Iwatch	2013-04-02	12
Iwatch	2013-04-03	234

What I want is to compose a third dataframe based on data_product containing only rows where date is a discount date therefore contained between discount_start_date and discount_end_date of the discount_dates df.

result

product	date	n_sales
Iphone	2013-03-01	1432
Iphone	2013-03-02	2314
Iphone	2013-03-03	3200
Iphone	2017-02-04	2230

my approach in polars was the following:

import polars as pl

discount_dates = {
    "product": ["Iphone", "Iphone", "Macbook", "Iwatch", "Iwatch", "Iwatch"],
    "discount_start_date": ["2012-12-31", "2017-01-31", "2013-02-22", "2012-12-31", "2013-12-31", "2014-12-31"],
    "discount_end_date": ["2013-03-03", "2019-12-03", "2013-03-03", "2013-03-03", "2014-03-03", "2015-03-03"],
}

discount_dates = pl.DataFrame(discount_dates)

data_product = {
    "product": ["Iphone", "Iphone", "Iphone", "Iphone", "Iphone", "Iwatch", "Iwatch", "Iwatch"],
    "date": ["2013-03-01", "2013-03-02", "2013-03-03", "2013-03-04", "2017-02-04", "2013-04-01", "2013-04-02", "2013-04-03"],
    "n_sales": [1432, 2314, 3200, 123, 2230, 102, 12, 234],
}

data_product = pl.DataFrame(data_product)

discount_dates = discount_dates.group_by("product").agg(pl.col("discount_start_date").min(), pl.col("discount_end_date").max())
data_product = data_product.join(discount_dates, on="product")
promo_product = data_product.filter((pl.col("date").is_between(*["discount_start_date","discount_end_date"])))

Unfortunately though this doesn't take in consideration gaps among the discount intervals and in the example above promo_product ends up being equal to data_product. Is there a smart way to tackle this in polars?

Solution

You have the right approach, just you don't need to do any group_by at all in this operation. You want to join every possible discount window here.

I got the desired DataFrame with the code above by simply commenting out discount_dates = ... (and you can select just those three columns from there).