Search code examples
pythonpython-polars

Polars top_k_by with over, k = 1. Bug?


Given the following dataFrame:

pl.DataFrame({
    'A': ['a0', 'a0', 'a1', 'a1'],
    'B': ['b1', 'b2', 'b1', 'b2'],
    'x': [0, 10, 5, 1]
})

I want to take value of column B with max value of column x within same value of A (taken from this question).

I know there's solution with pl.Expr.get() and pl.Expr.arg_max(), but I wanted to use pl.Expr.top_k_by() instead, and for some reason it doesn't work for me with k = 1:

df.with_columns(
    pl.col.B.top_k_by("x", 1).over("A").alias("y")
)
ComputeError: the length of the window expression did not match that of the group

Error originated in expression: 'col("B").top_k_by([dyn int: 1, col("x")]).over([col("A")])'

It does work for k = 2 though. Do you think it's a bug?


Solution

  • The error message produced when running your code without the window function gives is a bit more explicit and hints at a solution.

    df.with_columns(
        pl.col("B").top_k_by("x", 1)
    )
    
    InvalidOperationError: Series B, length 1 doesn't match the DataFrame height of 4
    
    If you want expression: col("B").top_k_by([dyn int: 1, col("x")]) to be broadcasted, ensure it is a scalar (for instance by adding '.first()').
    

    Especially, pl.Expr.first can be used to allow for proper broadcasting here.

    df.with_columns(
        pl.col("B").top_k_by("x", 1).first().over("A").alias("y")
    )
    
    shape: (4, 4)
    ┌─────┬─────┬─────┬─────┐
    │ A   ┆ B   ┆ x   ┆ y   │
    │ --- ┆ --- ┆ --- ┆ --- │
    │ str ┆ str ┆ i64 ┆ str │
    ╞═════╪═════╪═════╪═════╡
    │ a0  ┆ b1  ┆ 0   ┆ b2  │
    │ a0  ┆ b2  ┆ 10  ┆ b2  │
    │ a1  ┆ b1  ┆ 5   ┆ b1  │
    │ a1  ┆ b2  ┆ 1   ┆ b1  │
    └─────┴─────┴─────┴─────┘