So what I'm trying to do is fill the Null values of the Revenue column that I have on a dataframe with the mean Revenue of each store group.
Original Dataframe:
df = spark.createDataFrame(
[
('Store 1', 1, 448),
('Store 1', 2, None),
('Store 1', 3, 499),
('Store 1', 44, 432),
(None,None,None),
('Store 2', 1, 355),
('Store 2', 1, 355),
('Store 2', None, 345),
('Store 2', 3, 387),
('Store 2', 4, 312),
('Store 2', None, 555),
(None, 10, None)
],
['Store', 'WeekInMonth', 'Revenue']
)
I have created a second dataframe from the original except in this one I'm just trying to get the means of each one
df2 = df.groupBy('Store').avg('Revenue')
df2 = df2.select("Store", col("avg(Revenue)").alias("Revenue")).where("Store = 'Store 2' or Store = 'Store 1'")
display(df2)
What I'm trying to do is fill the Null values on the first dataframe with the average value from df2 per store.
To replace nulls with mean revenue per store, use a window
function.
df.show()
+-------+-----------+-------+
| Store|WeekInMonth|Revenue|
+-------+-----------+-------+
|Store 1| 1| 448|
|Store 1| 2| null|
|Store 1| 3| 499|
|Store 1| 44| 432|
|Store 2| null| 345|
|Store 2| null| 555|
|Store 2| 1| 355|
|Store 2| 1| 355|
|Store 2| 3| 387|
|Store 2| 4| 312|
| null| null| null|
| null| 10| null|
+-------+-----------+-------+
Compute mean per window, and replace revenue
with mean
whenever it is null
.
from pyspark.sql.functions import *
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w=Window().partitionBy("Store")
df.withColumn("mean", F.mean("Revenue").over(w))\
.withColumn("Revenue", F.when(col("Revenue").isNull(), col("mean"))\
.otherwise(col("Revenue"))).drop("mean").show()
+-------+-----------+-----------------+
| Store|WeekInMonth| Revenue|
+-------+-----------+-----------------+
|Store 1| 1| 448.0|
|Store 1| 2|459.6666666666667|
|Store 1| 3| 499.0|
|Store 1| 44| 432.0|
|Store 2| null| 345.0|
|Store 2| null| 555.0|
|Store 2| 1| 355.0|
|Store 2| 1| 355.0|
|Store 2| 3| 387.0|
|Store 2| 4| 312.0|
| null| null| null|
| null| 10| null|
+-------+-----------+-----------------+