Search code examples
scalaapache-sparkapache-spark-sql

How to obtain the average of an array-type column in scala-spark over all row entries per entry?


I got an array column with 512 double elements, and want to get the average. Take an array column with length=3 as example:

val x = Seq("2 4 6", "0 0 0").toDF("value").withColumn("value", split($"value", " "))
x.printSchema()
x.show()


root
 |-- value: array (nullable = true)
 |    |-- element: string (containsNull = true)

+---------+
|    value|
+---------+
|[2, 4, 6]|
|[0, 0, 0]|
+---------+

The following result is desired:

x.select(..... as "avg_value").show()

------------
|avg_value |
------------
|[1,2,3]   |
------------

Solution

  • Consider each array element as column and calculate average then construct array with those columns:

    val array_size = 3
    val avgAgg = for (i <- 0 to array_size -1) yield avg($"value".getItem(i))
    df.select(array(avgAgg: _*).alias("avg_value")).show(false)
    

    Gives:

    +---------------+
    |avg_value      |
    +---------------+
    |[1.0, 2.0, 3.0]|
    +---------------+