Search code examples
apache-sparkdataframeaggregate-functionswindow-functions

Spark : Is there differences between agg function and a window function on a spark dataframe?


I want to apply a sum on a column in spark Dataframe (Spark 2.1). I have two ways to do that:

1- With a Window function:

val windowing = Window.partitionBy("id")
dataframe
.withColumn("sum", sum(col("column_1")) over windowing)

2- With the agg function:

dataframe
.groupBy("id")
.agg(sum(col("column_1")).alias("sum"))

What is the best way to do that in term of performances? And what's the differences between these two methods?


Solution

  • You can use aggregation functions both within a window (your first case), or when grouping (your second case). The difference is that with a window, each row will be associated with the result of the aggregation computed on its entire window. When grouping however, each group will be associated with the result of the aggregation on that group (a group of rows becomes only one row).

    In your situation, you would get this.

    val dataframe = spark.range(6).withColumn("key", 'id % 2)
    dataframe.show
    +---+---+
    | id|key|
    +---+---+
    |  0|  0|
    |  1|  1|
    |  2|  0|
    |  3|  1|
    |  4|  0|
    |  5|  1|
    +---+---+
    

    Case 1: windowing

    val windowing = Window.partitionBy("key")
    dataframe.withColumn("sum", sum(col("id")) over windowing).show
    +---+---+---+                                                                   
    | id|key|sum|
    +---+---+---+
    |  0|  0|  6|
    |  2|  0|  6|
    |  4|  0|  6|
    |  1|  1|  9|
    |  3|  1|  9|
    |  5|  1|  9|
    +---+---+---+
    

    Case 2: grouping

    dataframe.groupBy("key").agg(sum('id)).show
    +---+-------+
    |key|sum(id)|
    +---+-------+
    |  0|      6|
    |  1|      9|
    +---+-------+