Say I have a DataFrame with an id column like this:
df = pl.from_repr("""
┌─────┐
│ id │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 1 │
│ 1 │
│ 2 │
│ 2 │
│ 3 │
│ 3 │
└─────┘
""")
I want to aggregate a running count over the id column, giving this result:
┌─────┬───────┐
│ id ┆ count │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═══════╡
│ 1 ┆ 1 │
│ 1 ┆ 2 │
│ 1 ┆ 3 │
│ 2 ┆ 1 │
│ 2 ┆ 2 │
│ 3 ┆ 1 │
│ 3 ┆ 2 │
└─────┴───────┘
My attempt involved creating a dummy column, which I think produced the desired result but seems a bit hacky.
(
df.with_columns(
pl.lit(1).alias("ones")
)
.with_columns(
(pl.col("ones").cum_sum().over("id")).alias("count")
)
.drop("ones")
)
However when I try this:
(
df.with_columns(
(pl.lit(1).cum_sum().over("id")).alias("count")
)
.drop("ones")
)
I get the error "ComputeError: the length of the window expression did not match that of the group".
Is there a better way to do this? What am I missing in my attempt above?
cum_count seems to do the job with a 1-shift adjustment (or any shift you want, of course).
There is also int_range:
df.with_columns(
count = pl.col('id').cum_count().over('id'),
count2 = pl.int_range(pl.len()).over('id') + 1
)
shape: (7, 3)
┌─────┬───────┬────────┐
│ id ┆ count ┆ count2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ i64 │
╞═════╪═══════╪════════╡
│ 1 ┆ 1 ┆ 1 │
│ 1 ┆ 2 ┆ 2 │
│ 1 ┆ 3 ┆ 3 │
│ 2 ┆ 1 ┆ 1 │
│ 2 ┆ 2 ┆ 2 │
│ 3 ┆ 1 ┆ 1 │
│ 3 ┆ 2 ┆ 2 │
└─────┴───────┴────────┘