I would like to group a dataframe by "foo", and take the first value of the sorted group.
I have one solution, but it involves sorting the entire dataframe, where I believe it would be much faster to sort within the groups. There would be more instances of sort, but the size n
of each group is much smaller, and sorting performs as n*log(n)
.
df = pl.DataFrame({"foo": [1, 1, 1, 2, 2, 2, 3], "bar": [5, 7, 6, 4, 2, 3, 1]})
df_desired = pl.DataFrame({"foo": [1, 2, 3], "bar": [5, 2, 1]})
df_solution = df.sort("bar").group_by("foo", maintain_order=True).first().sort(by="foo")
assert df_desired.equals(df_solution)
My suggestion would be a method that would sort each group. Does this sort of thing exist?
df_suggestion = df.group_by("foo").<sort_groupby(by="bar")>.first()
Try:
df.group_by("foo").agg(pl.col("bar").sort().first()).sort(by="foo")