Search code examples
python-polars

Best way of sorting a group_by expression when taking first row


I would like to group a dataframe by "foo", and take the first value of the sorted group.

I have one solution, but it involves sorting the entire dataframe, where I believe it would be much faster to sort within the groups. There would be more instances of sort, but the size n of each group is much smaller, and sorting performs as n*log(n).

df = pl.DataFrame({"foo": [1, 1, 1, 2, 2, 2, 3], "bar": [5, 7, 6, 4, 2, 3, 1]})

df_desired = pl.DataFrame({"foo": [1, 2, 3], "bar": [5, 2, 1]})

df_solution = df.sort("bar").group_by("foo", maintain_order=True).first().sort(by="foo")

assert df_desired.equals(df_solution)

My suggestion would be a method that would sort each group. Does this sort of thing exist?

df_suggestion = df.group_by("foo").<sort_groupby(by="bar")>.first()

Solution

  • Try:

    df.group_by("foo").agg(pl.col("bar").sort().first()).sort(by="foo")