Polars Pivot Dataframe an count the cumulative uniques ID

I have a polars dataframe that contains and ID, DATE and OS. For each day i would like to count how many uniques ID are until that day.

import polars as pl
df = (
    pl.DataFrame(
        {
            "DAY": [1,1,1,2,2,2,3,3,3],
            "OS" : ["A","B","A","B","A","B","A","B","A"],
            "ID": ["X","Y","Z","W","X","J","K","L","X"]
        }
    )
)

Desired Output:

shape: (3, 3)
┌─────┬─────┬─────┐
│ DAY ┆ A   ┆ B   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 1   │
│ 2   ┆ 2   ┆ 3   │
│ 3   ┆ 3   ┆ 4   │
└─────┴─────┴─────┘

It should looks like this, because on day 1, the are 3 values and 3 ID . On day 2 the ID "X" its reapeted with the same OS so, the columns A remains the same, and the other 2 are different so add 2 to B. On day 3, the ID X its reapeated with A, and the other 2 are different, so it sums again over each column.

I think it could be solved with an approach like the following:

(
    df
    .pivot(
        index="DAY",
        on="OS",
        aggregate_function=(pl.col("ID").cum_sum().unique())
    )
)

Solution

You can use Expr.is_first_distinct mark each of the first distinct entries of 'ID' within each 'OS'. Then you can pivot those results and take their cumulative sum.

import polars as pl
df = (
    pl.DataFrame(
        {
            "DAY": [1,1,1,2,2,2,3,3,3],
            "OS" : ["A","B","A","B","A","B","A","B","A"],
            "ID": ["X","Y","Z","W","X","J","K","L","X"]
        }
    )
)

print(
    df
    .with_columns(pl.col('ID').is_first_distinct().over('OS'))
    .pivot(
        index='DAY',
        on='OS',
        aggregate_function=pl.col('ID').sum()
    )
    .with_columns(pl.exclude('DAY').cum_sum())
)
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ DAY ┆ A   ┆ B   │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ u32 ┆ u32 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 2   ┆ 1   │
# │ 2   ┆ 2   ┆ 3   │
# │ 3   ┆ 3   ┆ 4   │
# └─────┴─────┴─────┘