Search code examples
apache-sparkpysparkapache-spark-sqlmodesparkr

Get mode (most often) value in Spark column with groupBy


I have a SparkR DataFrame and I want to get the mode (most often) value for each unique name. How can I do this? There doesn't seem to be a built-in mode function. Either a SparkR or PySpark solution will do.

# Create DF
df <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"),
  value = c(5, 5, 4, 3, 3, 7))
DF <- createDataFrame(df)

name   | value
-----------------
Thomas |  5
Thomas |  5
Thomas |  4
Bill   |  3
Bill   |  3
Bill   |  9

# What I want to get
name   | mode(value)
-----------------
Thomas |   5
Bill   |   3 

Solution

  • You could achieve that using combination of .groupBy() and window methods like this:

    grouped = df.groupBy('name', 'value').count()
    window = Window.partitionBy("name").orderBy(desc("count"))
    grouped\
        .withColumn('order', row_number().over(window))\
        .where(col('order') == 1)\
        .show()
    

    outputs:

    +------+-----+-----+-----+
    |  name|value|count|order|
    +------+-----+-----+-----+
    |  Bill|    3|    2|    1|
    |Thomas|    5|    2|    1|
    +------+-----+-----+-----+