Search code examples
scalaapache-sparkgroup-byaggregatequantile

How to find quantiles inside agg() function after groupBy in Scala SPARK


I have a dataframe, in which I want to groupBy column A then find different stats like mean, min, max, std dev and quantiles.

I am able to find min, max and mean using the following code: df.groupBy("A").agg(min("B"), max("B"), mean("B")).show(50, false)

But I am unable to find the quantiles(0.25, 0.5, 0.75). I tried approxQuantile and percentile but it gives the following error:

error: not found: value approxQuantile


Solution

  • if you have Hive in classpath, you can use many UDAF like percentile_approx and stddev_samp, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)

    You can call these functions using callUDF:

    import ss.implicits._
    import org.apache.spark.sql.functions.callUDF
    
    val df = Seq(1.0,2.0,3.0).toDF("x")
    
    df.groupBy()
      .agg(
        callUDF("percentile_approx",$"x",lit(0.5)).as("median"),
        callUDF("stddev_samp",$"x").as("stdev")
      )
    .show()