Search code examples
mathpysparkapache-spark-sqlcalculated-columns

Mean calculation from dataframe columns with pyspark


I'm looking for solution how to do following calculation without conversion from spark df to pandas df:

mean = sum(df[A]*df[B])/sum(df[B])

Calculation based on selected columns in spark dataframe can be done by splitting it in pieces like: new_col = df[A]*df[B]
new_col = sum(new_col)
new_col2 = sum(df[B])
mean = new_col/new_col2 But I hope there are more sophisticated ways to do that. Perhaps by using spark .withColumn function?


Solution

  • You can create a new column with the product and then aggregate:

    from pyspark.sql import functions as F
    
    df = spark.createDataFrame(data=[[1, 2, 3], [1, 2, 3], [1, 2, 3]], schema=["A", "B", "C"])
    
    mean = (
        df
        .withColumn("AB", F.col("A") * F.col("B"))
        .groupBy()
        .agg(F.mean("AB").alias("AB"), F.mean("B").alias("B"))
        .withColumn("mean", F.col("AB") / F.col("B"))
    )
    
    mean.show()