Search code examples
apache-sparkpysparkapache-spark-sqlmeanstandard-deviation

Not able to get the average and standard deviation by multiple rows in PySpark


root
 |-- cores: long (nullable = true)
 |-- time0: double (nullable = true)
 |-- time1: double (nullable = true)
 |-- time2: double (nullable = true)

+-----+------------------+------------------+-----------------+
|cores|time0             |time1             |time2            |
+-----+------------------+------------------+-----------------+
|1    |26.362340927124023|25.891045093536377|26.19786810874939|
|2    |28.445404767990112|32.81148290634155 |30.37511706352234|
|4    |29.17068886756897 |28.47817611694336 |29.78126311302185|
+-----+------------------+------------------+-----------------+

I want the resultant dataframe which includes me the mean and the std-deviation column as well.

df_mean_stddev = df_cores.withColumn('*', F.mean(array(df_cores.columns[1:])).alias('mean'))
.withColumn(stddev(array(df_cores.columns[1:])).alias('stddev'))
df_mean_stddev.printSchema()

df_cores.show(truncate=False)

I Tried the above, but i am getting the error. None of the example seems to be working correctly for me, referred multiple aggregation by row. I am new to PySpark.


Solution

  • mean and stddev can caluclate the mean and standard deviation of columns but these functions do not work for rows.

    One approach to calculate the values per row would be to create an udf and then use the standard Python ways. But as there are only three columns in the dataset, the formulas can also be written directly in SQL:

    df.withColumn("mean", F.expr("(time0 + time1 + time2)/3")) \
        .withColumn("stddev", F.expr("sqrt((pow((time0-mean),2)+pow((time1-mean),2)+pow((time2-mean),2))/2)")) \
        .show()
    

    prints

    +-----+------------------+------------------+-----------------+------------------+-------------------+
    |cores|             time0|             time1|            time2|              mean|             stddev|
    +-----+------------------+------------------+-----------------+------------------+-------------------+
    |    1|26.362340927124023|25.891045093536377|26.19786810874939|26.150418043136597|0.23920403891711864|
    |    2|28.445404767990112| 32.81148290634155|30.37511706352234|30.544001579284668| 2.1879330570873967|
    |    4| 29.17068886756897| 28.47817611694336|29.78126311302185|29.143376032511394| 0.6519727164969239|
    +-----+------------------+------------------+-----------------+------------------+-------------------+