I have a problem i'm trying to solve using preferably polars but pandas is also fine. Say we have the following dataset (sample):
{
"date" : [2022-01-01, 2022-01-02, 2022-01-03, 2022-01-04, 2022-01-05],
"customers" : [3, 4, 5, 3, 2],
"is_reporting_day?" : [True, False, False, False, True]
}
to make it a bit clearer, here's a table format
date | customers | is_reporting_day? |
---|---|---|
2022-01-01 | 3 | True |
2022-01-02 | 4 | False |
2022-01-03 | 5 | False |
2022-01-04 | 3 | False |
2022-01-05 | 2 | True |
What i want to get is: if reporting_day is True
keep the customers number as is, and if the reporting_day is False
I want to sum up all the customers (4, 5, 3 = 12 + 2 = 14) and add it to the next True value reporting day
so after applying the transformation it should look like this:
date | customers | is_reporting_day? | customers |
---|---|---|---|
2022-01-01 | 3 | True | 3 |
2022-01-05 | 2 | True | 14 |
I've tried to use the the cumsum()
function in polars by using the pl.when
statement but it's the incorrect logic since it sums up from the beginning, i.e., the first day (there are about 700 days).
Note: the solution should be dynamic, i.e., sometimes the gap between a reporting_day and non-reporting_day is 1 day, 2 days, etc.
Any ideas or input is highly appreciated! Thanks in advance!
One way to approach the problem is to create groups based on the is_reporting_day?
column.
If we take the date
when it is True
df.select(
pl.when("is_reporting_day?").then(pl.col("date"))
)
shape: (5, 1)
┌────────────┐
│ date │
│ --- │
│ date │
╞════════════╡
│ 2022-01-01 │
│ null │
│ null │
│ null │
│ 2022-01-05 │
└────────────┘
We can then .backward_fill()
the date
to include the previous False rows.
date = pl.when("is_reporting_day?").then(pl.col("date"))
(df.group_by(date.backward_fill(), maintain_order=True)
.agg(
pl.all().last(),
pl.sum("customers").suffix("_sum")
)
)
shape: (2, 4)
┌────────────┬───────────┬───────────────────┬───────────────┐
│ date ┆ customers ┆ is_reporting_day? ┆ customers_sum │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ bool ┆ i64 │
╞════════════╪═══════════╪═══════════════════╪═══════════════╡
│ 2022-01-01 ┆ 3 ┆ true ┆ 3 │
│ 2022-01-05 ┆ 2 ┆ true ┆ 14 │
└────────────┴───────────┴───────────────────┴───────────────┘