Search code examples
cumulative-sumpython-polars

Polars cumulative sum over consecutive groups


I have a DataFrame like so:

| Date       | Group | Value |
|------------|-------|-------|
| 2020-01-01 | 0     | 5     |
| 2020-01-02 | 0     | 8     |
| 2020-01-03 | 0     | 9     |
| 2020-01-01 | 1     | 5     |
| 2020-01-02 | 1     | -1    |
| 2020-01-03 | 1     | 2     |
| 2020-01-01 | 2     | -2    |
| 2020-01-02 | 2     | -1    |
| 2020-01-03 | 2     | 7     |

I want to do a cumulative sum grouped by "Date" in the order of the "Group" consecutively, something like:

| Date       | Group | Value            |
|------------|-------|------------------|
| 2020-01-01 | 0     | 5                |
| 2020-01-02 | 0     | 8                |
| 2020-01-03 | 0     | 9                |
| 2020-01-01 | 1     | 10 (= 5 + 5)     |
| 2020-01-02 | 1     | 7  (= 8 - 1)     |
| 2020-01-03 | 1     | 11 (= 9 + 2)     |
| 2020-01-01 | 2     | 8  (= 5 + 5 - 2) |
| 2020-01-02 | 2     | 6  (= 8 - 1 - 1) |
| 2020-01-03 | 2     | 18 (= 9 + 2 + 7) |

The explanation for these values is as follows. Group 0 precedes group 1 and group 1 precedes group 2. For the values of group 0, we need not do anything, cumulative sum up to this group are just the original values. For the values of group 1, we accumulate the values of group 0 for each date. Similarly, for group 2, we accumulate the values of group 1 and group 0.

What I have tried is to do this via a helper pivot table. I do it iteratively by looping over the Groups and doing a cumulative sum over a partial selection of the columns and adding that into a list of new values. Then, I replace these new values with into a column into the original DF.

from io import StringIO

import polars as pl


df = pl.read_csv(StringIO("""
Date,Group,Value
2020-01-01,0,5
2020-01-02,0,8
2020-01-03,0,9
2020-01-01,1,5
2020-01-02,1,-1
2020-01-03,1,2
2020-01-01,2,-2
2020-01-02,2,-1
2020-01-03,2,7
"""), try_parse_dates=True)

ddf = df.pivot(on='Group', index='Date', values='Value')

new_vals = []
for i in range(df['Group'].max() + 1):
    new_vals.extend(
        ddf.select([pl.col(f'{j}') for j in range(i+1)])
           .sum_horizontal()
           .to_list()
    )

df.with_columns(pl.Series(new_vals).alias('CumSumValue'))

Is there a way to do this without loops or all this "inelegance"?


Solution

  • So assuming that the columns are ordered, you can just create an index over the groups and then cumsum over date and index

    df = df.with_columns(pl.col("Date").cum_count().over("Group").alias("Index"))
        
    df.select(
        pl.col("Date", "Group"),
        pl.col("Value").cum_sum().over("Date", "Index").alias("Value"),
    )
    
    
    shape: (9, 3)
    ┌────────────┬───────┬───────┐
    │ Date       ┆ Group ┆ Value │
    │ ---        ┆ ---   ┆ ---   │
    │ date       ┆ i64   ┆ i64   │
    ╞════════════╪═══════╪═══════╡
    │ 2020-01-01 ┆ 0     ┆ 5     │
    │ 2020-01-02 ┆ 0     ┆ 8     │
    │ 2020-01-03 ┆ 0     ┆ 9     │
    │ 2020-01-01 ┆ 1     ┆ 10    │
    │ 2020-01-02 ┆ 1     ┆ 7     │
    │ 2020-01-03 ┆ 1     ┆ 11    │
    │ 2020-01-01 ┆ 2     ┆ 8     │
    │ 2020-01-02 ┆ 2     ┆ 6     │
    │ 2020-01-03 ┆ 2     ┆ 18    │
    └────────────┴───────┴───────┘