Saying I have 2 dataframes:
discount_dates
product | discount_start_date | discount_end_date |
---|---|---|
Iphone | 2012-12-31 | 2013-03-03 |
Iphone | 2017-01-31 | 2019-12-03 |
Macbook | 2013-02-22 | 2013-03-03 |
Iwatch | 2012-12-31 | 2013-03-03 |
Iwatch | 2013-12-31 | 2014-03-03 |
Iwatch | 2014-12-31 | 2015-03-03 |
data_product
product | date | n_sales |
---|---|---|
Iphone | 2013-03-01 | 1432 |
Iphone | 2013-03-02 | 2314 |
Iphone | 2013-03-03 | 3200 |
Iphone | 2013-03-04 | 123 |
Iphone | 2017-02-04 | 2230 |
Iwatch | 2013-04-01 | 102 |
Iwatch | 2013-04-02 | 12 |
Iwatch | 2013-04-03 | 234 |
What I want is to compose a third dataframe based on data_product
containing only rows where date
is a discount date therefore contained between discount_start_date
and discount_end_date
of the discount_dates
df.
result
product | date | n_sales |
---|---|---|
Iphone | 2013-03-01 | 1432 |
Iphone | 2013-03-02 | 2314 |
Iphone | 2013-03-03 | 3200 |
Iphone | 2017-02-04 | 2230 |
my approach in polars was the following:
import polars as pl
discount_dates = {
"product": ["Iphone", "Iphone", "Macbook", "Iwatch", "Iwatch", "Iwatch"],
"discount_start_date": ["2012-12-31", "2017-01-31", "2013-02-22", "2012-12-31", "2013-12-31", "2014-12-31"],
"discount_end_date": ["2013-03-03", "2019-12-03", "2013-03-03", "2013-03-03", "2014-03-03", "2015-03-03"],
}
discount_dates = pl.DataFrame(discount_dates)
data_product = {
"product": ["Iphone", "Iphone", "Iphone", "Iphone", "Iphone", "Iwatch", "Iwatch", "Iwatch"],
"date": ["2013-03-01", "2013-03-02", "2013-03-03", "2013-03-04", "2017-02-04", "2013-04-01", "2013-04-02", "2013-04-03"],
"n_sales": [1432, 2314, 3200, 123, 2230, 102, 12, 234],
}
data_product = pl.DataFrame(data_product)
discount_dates = discount_dates.group_by("product").agg(pl.col("discount_start_date").min(), pl.col("discount_end_date").max())
data_product = data_product.join(discount_dates, on="product")
promo_product = data_product.filter((pl.col("date").is_between(*["discount_start_date","discount_end_date"])))
Unfortunately though this doesn't take in consideration gaps among the discount intervals and in the example above promo_product
ends up being equal to data_product
. Is there a smart way to tackle this in polars?
You have the right approach, just you don't need to do any group_by
at all in this operation. You want to join every possible discount window here.
I got the desired DataFrame with the code above by simply commenting out discount_dates = ...
(and you can select just those three columns from there).