scala apache-spark group-by aggregate quantile

How to find quantiles inside agg() function after groupBy in Scala SPARK

I have a dataframe, in which I want to groupBy column A then find different stats like mean, min, max, std dev and quantiles.

I am able to find min, max and mean using the following code: df.groupBy("A").agg(min("B"), max("B"), mean("B")).show(50, false)

But I am unable to find the quantiles(0.25, 0.5, 0.75). I tried approxQuantile and percentile but it gives the following error:

error: not found: value approxQuantile

Solution

if you have Hive in classpath, you can use many UDAF like percentile_approx and stddev_samp, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)

You can call these functions using callUDF:

import ss.implicits._
import org.apache.spark.sql.functions.callUDF

val df = Seq(1.0,2.0,3.0).toDF("x")

df.groupBy()
  .agg(
    callUDF("percentile_approx",$"x",lit(0.5)).as("median"),
    callUDF("stddev_samp",$"x").as("stdev")
  )
.show()