Given the following dataFrame:
pl.DataFrame({
'A': ['a0', 'a0', 'a1', 'a1'],
'B': ['b1', 'b2', 'b1', 'b2'],
'x': [0, 10, 5, 1]
})
I want to take value of column B
with max value of column x
within same value of A
(taken from this question).
I know there's solution with pl.Expr.get()
and pl.Expr.arg_max()
, but I wanted to use pl.Expr.top_k_by()
instead, and for some reason it doesn't work for me with k
= 1
:
df.with_columns(
pl.col.B.top_k_by("x", 1).over("A").alias("y")
)
ComputeError: the length of the window expression did not match that of the group
Error originated in expression: 'col("B").top_k_by([dyn int: 1, col("x")]).over([col("A")])'
It does work for k
= 2
though.
Do you think it's a bug?
The error message produced when running your code without the window function gives is a bit more explicit and hints at a solution.
df.with_columns(
pl.col("B").top_k_by("x", 1)
)
InvalidOperationError: Series B, length 1 doesn't match the DataFrame height of 4
If you want expression: col("B").top_k_by([dyn int: 1, col("x")]) to be broadcasted, ensure it is a scalar (for instance by adding '.first()').
Especially, pl.Expr.first
can be used to allow for proper broadcasting here.
df.with_columns(
pl.col("B").top_k_by("x", 1).first().over("A").alias("y")
)
shape: (4, 4)
┌─────┬─────┬─────┬─────┐
│ A ┆ B ┆ x ┆ y │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ str │
╞═════╪═════╪═════╪═════╡
│ a0 ┆ b1 ┆ 0 ┆ b2 │
│ a0 ┆ b2 ┆ 10 ┆ b2 │
│ a1 ┆ b1 ┆ 5 ┆ b1 │
│ a1 ┆ b2 ┆ 1 ┆ b1 │
└─────┴─────┴─────┴─────┘