Search code examples
python-polars

how to get the most prevalent category in each group in a polars data frame?


The task is to compute the category with the highest frequency from column "b", for each group determined by "a". Below is an example with the desired output, computed using pandas & numpy.

import polars as pl
import numpy as np
di = [
 {'a': 1, 'b': 'x'},
 {'a': 1, 'b': 't'},
 {'a': 1, 'b': 't'},
 {'a': 2, 'b': 'y'},
 {'a': 2, 'b': 'z'},
 {'a': 2, 'b': 'z'},
 {'a': 3, 'b': 'u'},
 {'a': 3, 'b': 'u'}
]

def most_prevalent(group: pd.DataFrame) -> np.ndarray:
    values, counts = np.unique(group, return_counts=True)
    return values[np.argmax(counts)]

print(pl.DataFrame(di).to_pandas().groupby("a").apply(most_prevalent).to_markdown(headers=["a", "b"]))

puts out:

|   a | b   |
|----:|:----|
|   1 | t   |
|   2 | z   |
|   3 | u   |

Any hints are appreciated. Thanks


Solution

  • After consulting a gpt4 instance, it turns out polars does have a .mode(). The answer would be then:

    pl.DataFrame(di).group_by("a").agg(pl.col("b").mode()).sort("a")