math pyspark apache-spark-sql calculated-columns

Mean calculation from dataframe columns with pyspark

I'm looking for solution how to do following calculation without conversion from spark df to pandas df:

mean = sum(df[A]*df[B])/sum(df[B])

Calculation based on selected columns in spark dataframe can be done by splitting it in pieces like: new_col = df[A]*df[B]
new_col = sum(new_col)
new_col2 = sum(df[B])
mean = new_col/new_col2 But I hope there are more sophisticated ways to do that. Perhaps by using spark .withColumn function?

Solution

You can create a new column with the product and then aggregate:

from pyspark.sql import functions as F

df = spark.createDataFrame(data=[[1, 2, 3], [1, 2, 3], [1, 2, 3]], schema=["A", "B", "C"])

mean = (
    df
    .withColumn("AB", F.col("A") * F.col("B"))
    .groupBy()
    .agg(F.mean("AB").alias("AB"), F.mean("B").alias("B"))
    .withColumn("mean", F.col("AB") / F.col("B"))
)

mean.show()

Combine rows if the dates exist in another row
Distinguish between two maximum values in a column with Power Query
Power Query Conditional JOIN - JOIN with WHERE clause
Calculate combinations for a range of text values in excel
How to update a column's values in Power BI M Language
What's the difference between DAX and Power Query (or M)?
Sorting issue when consolidating monthly data into a year to date file. Using query, sorting by day, time, and letter
Power Query `List.Generate` runs too slow
Custom Colum based on specific dates in powerquery
How do I properly use table.group in a PowerQuery query to dynamically summarize different rows and columns?
Excel Power Query syntax issue transform multiple columns of records
How to capitalize only first letter of a sentence
Power Query Import For Power BI with group by on SelectRows
Get data from config table cell when loading data source with power query
Translate Excel Formula into Power Query
Counting matching pairs of letters in lists of bigrams in Power Query M
Bigrams of a string in Power Query M
Power Query code to refer to another query (and how buffering works)
M code for the equivalent of of Left (string, len(string)-11)
How to add a missing column to the Transform Sample File
How to add or insert ' (single quotes) for every string in a list in which strings are separated by commas in Power Query
Power BI M code Split column from Lowercase to Uppercase won't include diacritics (accents)
M Code(Power Query) to Remove Empty columns that runs fast
How to find the value with date closest to another date
Error when build AOSP 14 - internal error: undefined variable CTS_TEST_SUITES_DEFAULT
Call a table using a dynamic identifier
Problem getting the value from a cell using M in excel
How to rename columns based on the data within them in Power Query M?
Power Query return a field name based on parameter
Power Query: after combining 2 tables got something strange