Search code examples
pythonapache-sparkpysparkmode

how to get most frequent values of a dataframe in PySpark


I am well familiar with Pandas data frame where I use function "mode" and "groupby" to get most frequent values,like below

df3=df5.groupby(['band']).apply(lambda x: x.mode())

however I am facing some difficulties to get in PySpark.

I have a spark data frame as follows:

band      A3    A5  status
4G_band1800 12  18  TRUE
4G_band1800 12  18  FALSE
4G_band1800 10  18  TRUE
4G_band1800 12  12  TRUE
4g_band2300 6   24  FALSE
4g_band2300 6   22  FALSE
4g_band2300 6   24  FALSE
4g_band2300 3   24  TRUE

Screenshot of above

What I want is as follows:

band      A3    A5  status
4G_band1800 12  18  TRUE
4g_band2300 6   24  FALSE

Screenshot of above

I have tried all possible combinations but haven't got any reasonable output. Please suggest a way.


Solution

  • Without defining your own UDAF, you might define a mode function (udf) and use it with collect_list as follows:

    import pyspark.sql.functions as F
    @F.udf
    def mode(x):
        from collections import Counter
        return Counter(x).most_common(1)[0][0]
    
    cols = ['A3', 'A5', 'status']
    agg_expr = [mode(F.collect_list(col)).alias(col) for col in cols]
    df.groupBy('band').agg(*agg_expr).show()
    
    +-----------+---+---+------+
    |       band| A3| A5|status|
    +-----------+---+---+------+
    |4G_band1800| 12| 18|  true|
    |4g_band2300|  6| 24| false|
    +-----------+---+---+------+