I have a Polars dataframe like so:
c1 | c2 | c3 |
---|---|---|
a | a | 1 |
a | a | 1 |
a | b | 1 |
a | c | 1 |
d | a | 1 |
d | b | 1 |
I am trying to assign a number to each group of (c2, c3) within c1, so that would look like this:
c1 | c2 | c3 | rank |
---|---|---|---|
a | a | 1 | 0 |
a | a | 1 | 0 |
a | b | 1 | 1 |
a | c | 1 | 2 |
d | a | 1 | 0 |
d | b | 1 | 1 |
How do I accomplish this?
I see how to do a global ranking:
df.join(
df.select(["c1", "c2", "c3"])
.unique()
.with_columns(rank=pl.int_range(1, pl.len() + 1),
on=["c1", "c2", "c3"]
)
but that is a global ranking, not one within the c1 group. I also wonder if it possible to do this with over() instead of the groupby/join pattern.
Create a struct of columns c2
, c3
using pl.struct("c2", "c3")
, compute the dense rank over c1
, and then subtract 1 because the ranks start from 1 by default:
pl.struct("c2", "c3").rank("dense").over("c1") - 1
Full code:
import polars as pl
df = pl.DataFrame(
{
"c1": ["a", "a", "a", "a", "d", "d"],
"c2": ["a", "a", "b", "c", "a", "b"],
"c3": [1, 1, 1, 1, 1, 1],
}
)
df2 = df.with_columns(rank=pl.struct("c2", "c3").rank("dense").over("c1") - 1)
print(df2)
Output:
┌─────┬─────┬─────┬──────┐
│ c1 ┆ c2 ┆ c3 ┆ rank │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ u32 │
╞═════╪═════╪═════╪══════╡
│ a ┆ a ┆ 1 ┆ 0 │
│ a ┆ a ┆ 1 ┆ 0 │
│ a ┆ b ┆ 1 ┆ 1 │
│ a ┆ c ┆ 1 ┆ 2 │
│ d ┆ a ┆ 1 ┆ 0 │
│ d ┆ b ┆ 1 ┆ 1 │
└─────┴─────┴─────┴──────┘