Search code examples
pythondataframepython-polars

Filter polars dataframe using given time intervals


Saying I have 2 dataframes:

discount_dates

product discount_start_date discount_end_date
Iphone 2012-12-31 2013-03-03
Iphone 2017-01-31 2019-12-03
Macbook 2013-02-22 2013-03-03
Iwatch 2012-12-31 2013-03-03
Iwatch 2013-12-31 2014-03-03
Iwatch 2014-12-31 2015-03-03

data_product

product date n_sales
Iphone 2013-03-01 1432
Iphone 2013-03-02 2314
Iphone 2013-03-03 3200
Iphone 2013-03-04 123
Iphone 2017-02-04 2230
Iwatch 2013-04-01 102
Iwatch 2013-04-02 12
Iwatch 2013-04-03 234

What I want is to compose a third dataframe based on data_product containing only rows where date is a discount date therefore contained between discount_start_date and discount_end_date of the discount_dates df.

result

product date n_sales
Iphone 2013-03-01 1432
Iphone 2013-03-02 2314
Iphone 2013-03-03 3200
Iphone 2017-02-04 2230

my approach in polars was the following:

import polars as pl

discount_dates = {
    "product": ["Iphone", "Iphone", "Macbook", "Iwatch", "Iwatch", "Iwatch"],
    "discount_start_date": ["2012-12-31", "2017-01-31", "2013-02-22", "2012-12-31", "2013-12-31", "2014-12-31"],
    "discount_end_date": ["2013-03-03", "2019-12-03", "2013-03-03", "2013-03-03", "2014-03-03", "2015-03-03"],
}

discount_dates = pl.DataFrame(discount_dates)

data_product = {
    "product": ["Iphone", "Iphone", "Iphone", "Iphone", "Iphone", "Iwatch", "Iwatch", "Iwatch"],
    "date": ["2013-03-01", "2013-03-02", "2013-03-03", "2013-03-04", "2017-02-04", "2013-04-01", "2013-04-02", "2013-04-03"],
    "n_sales": [1432, 2314, 3200, 123, 2230, 102, 12, 234],
}

data_product = pl.DataFrame(data_product)

discount_dates = discount_dates.groupby("product").agg(pl.col("discount_start_date").min(), pl.col("discount_end_date").max())
data_product = data_product.join(discount_dates, on="product")
promo_product = data_product.filter((pl.col("date").is_between(*["discount_start_date","discount_end_date"])))

Unfortunately though this doesn't take in consideration gaps among the discount intervals and in the example above promo_product ends up being equal to data_product. Is there a smart way to tackle this in polars?


Solution

  • You have the right approach, just you don't need to do any groupby at all in this operation. You want to join every possible discount window here.

    I got the desired DataFrame with the code above by simply commenting out discount_dates = ... (and you can select just those three columns from there).