Search code examples
dataframepython-polars

How to select a row with minimum value over some other column


How to select a rows from dataframe based on minimum value in one column aggregated over another column. For example

a = pl.DataFrame({'a': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3], 'b': [10, 5, 10, 6, 5, 4, 5, 30, 25, 24, 30, 30], 'c': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]})
print(a)
-----------------
shape: (12, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 10  ┆ 1   │
│ 1   ┆ 5   ┆ 2   │
│ 1   ┆ 10  ┆ 3   │
│ 2   ┆ 6   ┆ 4   │
│ 2   ┆ 5   ┆ 5   │
│ 2   ┆ 4   ┆ 6   │
│ 2   ┆ 5   ┆ 7   │
│ 3   ┆ 30  ┆ 8   │
│ 3   ┆ 25  ┆ 9   │
│ 3   ┆ 24  ┆ 10  │
│ 3   ┆ 30  ┆ 11  │
│ 3   ┆ 30  ┆ 12  │
└─────┴─────┴─────┘

I need the following dataframe

shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 5   ┆ 2   │
│ 2   ┆ 4   ┆ 6   │
│ 3   ┆ 24  ┆ 10  │
└─────┴─────┴─────┘

so it's like groupby(by='a'), them min('b'), but I need to select entire row with minimum b.

Thanks.


Solution

  • You could group_by and .sort_by() + .first()

    (df.group_by("a", maintain_order=True)
       .agg(pl.all().sort_by("b").first())
    )
    
    shape: (3, 3)
    ┌─────┬─────┬─────┐
    │ a   ┆ b   ┆ c   │
    │ --- ┆ --- ┆ --- │
    │ i64 ┆ i64 ┆ i64 │
    ╞═════╪═════╪═════╡
    │ 1   ┆ 5   ┆ 2   │
    │ 2   ┆ 4   ┆ 6   │
    │ 3   ┆ 24  ┆ 10  │
    └─────┴─────┴─────┘