I have a polars dataframe that contains and ID, DATE and OS. For each day i would like to count how many uniques ID are until that day.
import polars as pl
df = (
pl.DataFrame(
{
"DAY": [1,1,1,2,2,2,3,3,3],
"OS" : ["A","B","A","B","A","B","A","B","A"],
"ID": ["X","Y","Z","W","X","J","K","L","X"]
}
)
)
Desired Output:
shape: (3, 3)
┌─────┬─────┬─────┐
│ DAY ┆ A ┆ B │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 1 │
│ 2 ┆ 2 ┆ 3 │
│ 3 ┆ 3 ┆ 4 │
└─────┴─────┴─────┘
It should looks like this, because on day 1, the are 3 values and 3 ID . On day 2 the ID "X" its reapeted with the same OS so, the columns A remains the same, and the other 2 are different so add 2 to B. On day 3, the ID X its reapeated with A, and the other 2 are different, so it sums again over each column.
I think it could be solved with an approach like the following:
(
df
.pivot(
index="DAY",
on="OS",
aggregate_function=(pl.col("ID").cum_sum().unique())
)
)
You can use Expr.is_first_distinct
mark each of the first distinct entries of 'ID'
within each 'OS'
. Then you can pivot those results and take their cumulative sum.
import polars as pl
df = (
pl.DataFrame(
{
"DAY": [1,1,1,2,2,2,3,3,3],
"OS" : ["A","B","A","B","A","B","A","B","A"],
"ID": ["X","Y","Z","W","X","J","K","L","X"]
}
)
)
print(
df
.with_columns(pl.col('ID').is_first_distinct().over('OS'))
.pivot(
index='DAY',
on='OS',
aggregate_function=pl.col('ID').sum()
)
.with_columns(pl.exclude('DAY').cum_sum())
)
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ DAY ┆ A ┆ B │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ u32 ┆ u32 │
# ╞═════╪═════╪═════╡
# │ 1 ┆ 2 ┆ 1 │
# │ 2 ┆ 2 ┆ 3 │
# │ 3 ┆ 3 ┆ 4 │
# └─────┴─────┴─────┘