Saying I have 2 dataframes:
discount_dates
product | discount_start_date | discount_end_date |
---|---|---|
Iphone | 2012-12-31 | 2013-03-03 |
Iphone | 2017-01-31 | 2019-12-03 |
Macbook | 2013-02-22 | 2013-03-03 |
Iwatch | 2012-12-31 | 2013-03-03 |
Iwatch | 2013-12-31 | 2014-03-03 |
Iwatch | 2014-12-31 | 2015-03-03 |
data_product
product | date | n_sales |
---|---|---|
Iphone | 2013-03-01 | 1432 |
Iphone | 2013-03-02 | 2314 |
Iphone | 2013-03-03 | 3200 |
Iphone | 2013-03-04 | 123 |
Iphone | 2017-02-04 | 2230 |
Iwatch | 2013-04-01 | 102 |
Iwatch | 2013-04-02 | 12 |
Iwatch | 2013-04-03 | 234 |
What I want is to compose a third dataframe based on data_product
containing only rows where date
is a discount date therefore contained between discount_start_date
and discount_end_date
of the discount_dates
df.
result
product | date | n_sales |
---|---|---|
Iphone | 2013-03-01 | 1432 |
Iphone | 2013-03-02 | 2314 |
Iphone | 2013-03-03 | 3200 |
Iphone | 2017-02-04 | 2230 |
my approach in polars was the following:
import polars as pl
discount_dates = {
"product": ["Iphone", "Iphone", "Macbook", "Iwatch", "Iwatch", "Iwatch"],
"discount_start_date": ["2012-12-31", "2017-01-31", "2013-02-22", "2012-12-31", "2013-12-31", "2014-12-31"],
"discount_end_date": ["2013-03-03", "2019-12-03", "2013-03-03", "2013-03-03", "2014-03-03", "2015-03-03"],
}
discount_dates = pl.DataFrame(discount_dates)
data_product = {
"product": ["Iphone", "Iphone", "Iphone", "Iphone", "Iphone", "Iwatch", "Iwatch", "Iwatch"],
"date": ["2013-03-01", "2013-03-02", "2013-03-03", "2013-03-04", "2017-02-04", "2013-04-01", "2013-04-02", "2013-04-03"],
"n_sales": [1432, 2314, 3200, 123, 2230, 102, 12, 234],
}
data_product = pl.DataFrame(data_product)
discount_dates = discount_dates.groupby("product").agg(pl.col("discount_start_date").min(), pl.col("discount_end_date").max())
data_product = data_product.join(discount_dates, on="product")
promo_product = data_product.filter((pl.col("date").is_between(*["discount_start_date","discount_end_date"])))
Unfortunately though this doesn't take in consideration gaps among the discount intervals and in the example above promo_product
ends up being equal to data_product
. Is there a smart way to tackle this in polars?
You have the right approach, just you don't need to do any groupby
at all in this operation. You want to join every possible discount window here.
I got the desired DataFrame with the code above by simply commenting out discount_dates = ...
(and you can select just those three columns from there).