I have a dataframe, in which I want to groupBy column A then find different stats like mean, min, max, std dev and quantiles.
I am able to find min, max and mean using the following code:
df.groupBy("A").agg(min("B"), max("B"), mean("B")).show(50, false)
But I am unable to find the quantiles(0.25, 0.5, 0.75). I tried approxQuantile and percentile but it gives the following error:
error: not found: value approxQuantile
if you have Hive in classpath, you can use many UDAF like percentile_approx and stddev_samp, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
You can call these functions using callUDF
:
import ss.implicits._
import org.apache.spark.sql.functions.callUDF
val df = Seq(1.0,2.0,3.0).toDF("x")
df.groupBy()
.agg(
callUDF("percentile_approx",$"x",lit(0.5)).as("median"),
callUDF("stddev_samp",$"x").as("stdev")
)
.show()