Is there a way to rewrite this:
import numpy
import polars
df = (polars
.DataFrame(dict(
j=numpy.random.randint(10, 99, 20),
))
.with_row_index()
.select(
g=polars.col('index') // 3,
j='j'
)
.with_columns(rn=1)
.with_columns(
rn=polars.col('rn').shift().fill_null(0).cum_sum().over('g')
)
)
print(df)
g (u32) j (i64) rn (i32)
0 47 0
0 22 1
0 82 2
1 19 0
1 85 1
1 15 2
2 89 0
2 74 1
2 26 2
3 11 0
3 86 1
3 81 2
4 16 0
4 35 1
4 60 2
5 30 0
5 28 1
5 94 2
6 21 0
6 38 1
shape: (20, 3)
so it adds rn
column without requiring it to add a column full of 1
s first? I.e. somehow rewrite this part:
.with_columns(rn=1)
.with_columns(
rn=polars.col('rn').shift().fill_null(0).cum_sum().over('g')
)
so that:
.with_columns(rn=1)
is not required? Basically reduce two expressions to one.
Or any other / better way to add a row count per group?
It can be done by generating an .int_range()
using the length of each group.
df.with_columns(rn = pl.int_range(pl.len()).over("g"))
shape: (20, 3)
┌─────┬─────┬─────┐
│ g ┆ j ┆ rn │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 0 ┆ 14 ┆ 0 │ # group_len = 3, range = [0, 1, 2]
│ 0 ┆ 81 ┆ 1 │
│ 0 ┆ 72 ┆ 2 │
│ 1 ┆ 34 ┆ 0 │
│ 1 ┆ 90 ┆ 1 │
│ … ┆ … ┆ … │
│ 5 ┆ 26 ┆ 0 │
│ 5 ┆ 44 ┆ 1 │
│ 5 ┆ 27 ┆ 2 │
│ 6 ┆ 70 ┆ 0 │
│ 6 ┆ 86 ┆ 1 │
└─────┴─────┴─────┘