Search code examples
dataframepython-polars

implementing cumcount() with subset option in polars


Consider this dummy dataset,

import pandas as pd
import numpy as np

np.random.seed(44)
num_rows = 8

data = {
    'item_id': np.random.choice(['A', 'B'], num_rows),
    'store_id': np.random.choice([1, 2], num_rows),
    'sold_quantity': np.random.randint(0, 5, num_rows),  
    'total_sku_count': np.random.choice([0, 1], num_rows)  
}

df = pd.DataFrame(data)

  item_id  store_id  sold_quantity  total_sku_count
0       A         2              3                1
1       B         2              3                1
2       B         1              3                1
3       B         1              4                0
4       B         2              1                0
5       B         2              1                0
6       A         1              0                1
7       A         2              4                1

I can calculate subset-wise cumulative sum like this in pandas,


subset = ["item_id",'store_id']
df['cum_count'] = df.groupby(subset).cumcount()+1

  item_id  store_id  sold_quantity  total_sku_count  count  cum_count
0       A         2              3                1      1          1
1       B         2              3                1      1          1
2       B         1              3                1      1          1
3       B         1              4                0      2          2
4       B         2              1                0      2          2
5       B         2              1                0      3          3
6       A         1              0                1      1          1
7       A         2              4                1      2          2

I'm trying to implement the same in Polars. Since I have relatively less experience with Polars, I'm having a hard time doing this. I've tried something like this, but it didn't work as expected,

subset = ["item_id",'store_id']
df = df.with_columns((pl.struct(subset).over(subset).cum_count()).alias("cum_counts"))

Please help me if there is a way to achieve this.Your support is much appreciated.


Solution

  • The position of the .over() is important.

    You want the .cum_count() for each group - so it must come before the .over()

    df.with_columns(
       pl.struct(subset).cum_count().over(subset).alias("cum_counts")
    )
    
    shape: (8, 5)
    ┌─────────┬──────────┬───────────────┬─────────────────┬────────────┐
    │ item_id ┆ store_id ┆ sold_quantity ┆ total_sku_count ┆ cum_counts │
    │ ---     ┆ ---      ┆ ---           ┆ ---             ┆ ---        │
    │ str     ┆ i64      ┆ i64           ┆ i64             ┆ u32        │
    ╞═════════╪══════════╪═══════════════╪═════════════════╪════════════╡
    │ A       ┆ 2        ┆ 3             ┆ 1               ┆ 0          │
    │ B       ┆ 2        ┆ 3             ┆ 1               ┆ 0          │
    │ B       ┆ 1        ┆ 3             ┆ 1               ┆ 0          │
    │ B       ┆ 1        ┆ 4             ┆ 0               ┆ 1          │
    │ B       ┆ 2        ┆ 1             ┆ 0               ┆ 1          │
    │ B       ┆ 2        ┆ 1             ┆ 0               ┆ 2          │
    │ A       ┆ 1        ┆ 0             ┆ 1               ┆ 0          │
    │ A       ┆ 2        ┆ 4             ┆ 1               ┆ 1          │
    └─────────┴──────────┴───────────────┴─────────────────┴────────────┘