The task is to compute the category with the highest frequency from column "b", for each group determined by "a". Below is an example with the desired output, computed using pandas & numpy.
import polars as pl
import numpy as np
di = [
{'a': 1, 'b': 'x'},
{'a': 1, 'b': 't'},
{'a': 1, 'b': 't'},
{'a': 2, 'b': 'y'},
{'a': 2, 'b': 'z'},
{'a': 2, 'b': 'z'},
{'a': 3, 'b': 'u'},
{'a': 3, 'b': 'u'}
]
def most_prevalent(group: pd.DataFrame) -> np.ndarray:
values, counts = np.unique(group, return_counts=True)
return values[np.argmax(counts)]
print(pl.DataFrame(di).to_pandas().groupby("a").apply(most_prevalent).to_markdown(headers=["a", "b"]))
puts out:
| a | b |
|----:|:----|
| 1 | t |
| 2 | z |
| 3 | u |
Any hints are appreciated. Thanks
After consulting a gpt4 instance, it turns out polars does have a .mode()
.
The answer would be then:
pl.DataFrame(di).group_by("a").agg(pl.col("b").mode()).sort("a")