Search code examples
pandaspython-polars

How can I sum cumulatively values on condition in polars or pandas? [with example]


I have a problem i'm trying to solve using preferably polars but pandas is also fine. Say we have the following dataset (sample):

{
  "date" : [2022-01-01, 2022-01-02, 2022-01-03, 2022-01-04, 2022-01-05],
  "customers" : [3, 4, 5, 3, 2],
  "is_reporting_day?" : [True, False, False, False, True]
}

to make it a bit clearer, here's a table format

date customers is_reporting_day?
2022-01-01 3 True
2022-01-02 4 False
2022-01-03 5 False
2022-01-04 3 False
2022-01-05 2 True

What i want to get is: if reporting_day is True keep the customers number as is, and if the reporting_day is False I want to sum up all the customers (4, 5, 3 = 12 + 2 = 14) and add it to the next True value reporting day

so after applying the transformation it should look like this:

date customers is_reporting_day? customers
2022-01-01 3 True 3
2022-01-05 2 True 14

I've tried to use the the cumsum() function in polars by using the pl.when statement but it's the incorrect logic since it sums up from the beginning, i.e., the first day (there are about 700 days).

Note: the solution should be dynamic, i.e., sometimes the gap between a reporting_day and non-reporting_day is 1 day, 2 days, etc.

Any ideas or input is highly appreciated! Thanks in advance!

‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎‎


Solution

  • One way to approach the problem is to create groups based on the is_reporting_day? column.

    If we take the date when it is True

    df.select(
       pl.when("is_reporting_day?").then(pl.col("date"))
    )
    
    shape: (5, 1)
    ┌────────────┐
    │ date       │
    │ ---        │
    │ date       │
    ╞════════════╡
    │ 2022-01-01 │
    │ null       │
    │ null       │
    │ null       │
    │ 2022-01-05 │
    └────────────┘
    

    We can then .backward_fill() the date to include the previous False rows.

    date = pl.when("is_reporting_day?").then(pl.col("date"))
    
    (df.group_by(date.backward_fill(), maintain_order=True)
       .agg(
          pl.all().last(),
          pl.sum("customers").suffix("_sum")
       )
    )
    
    shape: (2, 4)
    ┌────────────┬───────────┬───────────────────┬───────────────┐
    │ date       ┆ customers ┆ is_reporting_day? ┆ customers_sum │
    │ ---        ┆ ---       ┆ ---               ┆ ---           │
    │ date       ┆ i64       ┆ bool              ┆ i64           │
    ╞════════════╪═══════════╪═══════════════════╪═══════════════╡
    │ 2022-01-01 ┆ 3         ┆ true              ┆ 3             │
    │ 2022-01-05 ┆ 2         ┆ true              ┆ 14            │
    └────────────┴───────────┴───────────────────┴───────────────┘