Search code examples
pythonpython-polars

Replicate pandas ngroup behaviour in polars


I am currently trying to replicate ngroup behaviour in polars to get consecutive group indexes (the dataframe will be grouped over two columns). For the R crowd, this would be achieved in the dplyr world with dplyr::group_indices or the newer dplyr::cur_group_id.

As shown in the repro, I've tried couple avenues without much succcess, both approaches miss group sequentiality and merely return row counts by group.

Quick repro:

import polars as pl
import pandas as pd

df = pd.DataFrame(
    {
        "id": ["a", "a", "a", "a", "b", "b", "b", "b"],
        "cat": [1, 1, 2, 2, 1, 1, 2, 2],
    }
)

df_pl = pl.from_pandas(df)

print(df.groupby(["id", "cat"]).ngroup())
# This is the desired behaviour
# 0    0
# 1    0
# 2    1
# 3    1
# 4    2
# 5    2
# 6    3
# 7    3

print(df_pl.select(pl.len().over("id", "cat")))
# This is only counting observation by group
# ┌─────┐
# │ len │
# │ --- │
# │ u32 │
# ╞═════╡
# │ 2   │
# │ 2   │
# │ 2   │
# │ 2   │
# │ 2   │
# │ 2   │
# │ 2   │
# │ 2   │
# └─────┘

print(df_pl.group_by("id", "cat").agg(pl.len().alias("test")))
# shape: (4, 3)
# ┌─────┬─────┬──────┐
# │ id  ┆ cat ┆ test │
# │ --- ┆ --- ┆ ---  │
# │ str ┆ i64 ┆ u32  │
# ╞═════╪═════╪══════╡
# │ a   ┆ 1   ┆ 2    │
# │ a   ┆ 2   ┆ 2    │
# │ b   ┆ 1   ┆ 2    │
# │ b   ┆ 2   ┆ 2    │
# └─────┴─────┴──────┘

Solution

  • We can use rank for this:

    (df.with_row_index()
       .with_columns(
           pl.first("index").over("id", "cat").rank("dense") - 1
       )
    )
    
    shape: (8, 3)
    ┌───────┬─────┬─────┐
    │ index ┆ id  ┆ cat │
    │ ---   ┆ --- ┆ --- │
    │ u32   ┆ str ┆ i64 │
    ╞═══════╪═════╪═════╡
    │ 0     ┆ a   ┆ 1   │
    │ 0     ┆ a   ┆ 1   │
    │ 1     ┆ a   ┆ 2   │
    │ 1     ┆ a   ┆ 2   │
    │ 2     ┆ b   ┆ 1   │
    │ 2     ┆ b   ┆ 1   │
    │ 3     ┆ b   ┆ 2   │
    │ 3     ┆ b   ┆ 2   │
    └───────┴─────┴─────┘