Search code examples
pythonpython-polars

Polars Pivot Dataframe an count the cumulative uniques ID


I have a polars dataframe that contains and ID, DATE and OS. For each day i would like to count how many uniques ID are until that day.

import polars as pl
df = (
    pl.DataFrame(
        {
            "DAY": [1,1,1,2,2,2,3,3,3],
            "OS" : ["A","B","A","B","A","B","A","B","A"],
            "ID": ["X","Y","Z","W","X","J","K","L","X"]
        }
    )
)

Desired Output:

shape: (3, 3)
┌─────┬─────┬─────┐
│ DAY ┆ A   ┆ B   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 1   │
│ 2   ┆ 2   ┆ 3   │
│ 3   ┆ 3   ┆ 4   │
└─────┴─────┴─────┘

It should looks like this, because on day 1, the are 3 values and 3 ID . On day 2 the ID "X" its reapeted with the same OS so, the columns A remains the same, and the other 2 are different so add 2 to B. On day 3, the ID X its reapeated with A, and the other 2 are different, so it sums again over each column.

I think it could be solved with an approach like the following:

(
    df
    .pivot(
        index="DAY",
        on="OS",
        aggregate_function=(pl.col("ID").cum_sum().unique())
    )
)

Solution

  • You can use Expr.is_first_distinct mark each of the first distinct entries of 'ID' within each 'OS'. Then you can pivot those results and take their cumulative sum.

    import polars as pl
    df = (
        pl.DataFrame(
            {
                "DAY": [1,1,1,2,2,2,3,3,3],
                "OS" : ["A","B","A","B","A","B","A","B","A"],
                "ID": ["X","Y","Z","W","X","J","K","L","X"]
            }
        )
    )
    
    print(
        df
        .with_columns(pl.col('ID').is_first_distinct().over('OS'))
        .pivot(
            index='DAY',
            on='OS',
            aggregate_function=pl.col('ID').sum()
        )
        .with_columns(pl.exclude('DAY').cum_sum())
    )
    # shape: (3, 3)
    # ┌─────┬─────┬─────┐
    # │ DAY ┆ A   ┆ B   │
    # │ --- ┆ --- ┆ --- │
    # │ i64 ┆ u32 ┆ u32 │
    # ╞═════╪═════╪═════╡
    # │ 1   ┆ 2   ┆ 1   │
    # │ 2   ┆ 2   ┆ 3   │
    # │ 3   ┆ 3   ┆ 4   │
    # └─────┴─────┴─────┘