I am currently trying to replicate ngroup
behaviour in polars
to get consecutive group indexes (the dataframe will be grouped over two columns). For the R crowd, this would be achieved in the dplyr world with dplyr::group_indices
or the newer dplyr::cur_group_id
.
As shown in the repro, I've tried couple avenues without much succcess, both approaches miss group sequentiality and merely return row counts by group.
Quick repro:
import polars as pl
import pandas as pd
df = pd.DataFrame(
{
"id": ["a", "a", "a", "a", "b", "b", "b", "b"],
"cat": [1, 1, 2, 2, 1, 1, 2, 2],
}
)
df_pl = pl.from_pandas(df)
print(df.groupby(["id", "cat"]).ngroup())
# This is the desired behaviour
# 0 0
# 1 0
# 2 1
# 3 1
# 4 2
# 5 2
# 6 3
# 7 3
print(df_pl.select(pl.len().over("id", "cat")))
# This is only counting observation by group
# ┌─────┐
# │ len │
# │ --- │
# │ u32 │
# ╞═════╡
# │ 2 │
# │ 2 │
# │ 2 │
# │ 2 │
# │ 2 │
# │ 2 │
# │ 2 │
# │ 2 │
# └─────┘
print(df_pl.group_by("id", "cat").agg(pl.len().alias("test")))
# shape: (4, 3)
# ┌─────┬─────┬──────┐
# │ id ┆ cat ┆ test │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ u32 │
# ╞═════╪═════╪══════╡
# │ a ┆ 1 ┆ 2 │
# │ a ┆ 2 ┆ 2 │
# │ b ┆ 1 ┆ 2 │
# │ b ┆ 2 ┆ 2 │
# └─────┴─────┴──────┘
We can use rank
for this:
(df.with_row_index()
.with_columns(
pl.first("index").over("id", "cat").rank("dense") - 1
)
)
shape: (8, 3)
┌───────┬─────┬─────┐
│ index ┆ id ┆ cat │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ i64 │
╞═══════╪═════╪═════╡
│ 0 ┆ a ┆ 1 │
│ 0 ┆ a ┆ 1 │
│ 1 ┆ a ┆ 2 │
│ 1 ┆ a ┆ 2 │
│ 2 ┆ b ┆ 1 │
│ 2 ┆ b ┆ 1 │
│ 3 ┆ b ┆ 2 │
│ 3 ┆ b ┆ 2 │
└───────┴─────┴─────┘