I'm trying to write my own UDF for standard deviation for spark 1.5 and was hoping to see the implementation for 1.6. Thanks. If this is not possible, how would I go about writing a udf that calculates the standard deviation of a column given its columnName: (in scala):
def stddev(columnName: String): Column = {}
You can calculate the standard deviation with a UDF
within an aggregation like so:
val df = sc.parallelize(Seq(1,2,3,4)).toDF("myCol")
df.show
>+-----+
>|myCol|
>+-----+
>| 1|
>| 2|
>| 3|
>| 4|
>+-----+
def stddev(col: Column): Column = sqrt(avg(col * col) - avg(col) * avg(col))
df.agg(stddev($"myCol")).first
> [1.118033988749895]