Search code examples
dataframepysparkfillna

Pyspark fillna with value from another dataframe


So what I'm trying to do is fill the Null values of the Revenue column that I have on a dataframe with the mean Revenue of each store group.

Original Dataframe:

df = spark.createDataFrame(
[
  ('Store 1', 1, 448),
  ('Store 1', 2, None),
  ('Store 1', 3, 499),
  ('Store 1', 44, 432),
   (None,None,None),
  ('Store 2', 1, 355),
  ('Store 2', 1, 355),
  ('Store 2', None, 345),
  ('Store 2', 3, 387),
  ('Store 2', 4, 312),
  ('Store 2', None, 555),
  (None, 10, None)
],
   ['Store', 'WeekInMonth', 'Revenue']
)

I have created a second dataframe from the original except in this one I'm just trying to get the means of each one

df2 = df.groupBy('Store').avg('Revenue')
df2 = df2.select("Store", col("avg(Revenue)").alias("Revenue")).where("Store = 'Store 2' or Store = 'Store 1'")
display(df2)

What I'm trying to do is fill the Null values on the first dataframe with the average value from df2 per store.


Solution

  • To replace nulls with mean revenue per store, use a window function.

    df.show()
    +-------+-----------+-------+                                                   
    |  Store|WeekInMonth|Revenue|
    +-------+-----------+-------+
    |Store 1|          1|    448|
    |Store 1|          2|   null|
    |Store 1|          3|    499|
    |Store 1|         44|    432|
    |Store 2|       null|    345|
    |Store 2|       null|    555|
    |Store 2|          1|    355|
    |Store 2|          1|    355|
    |Store 2|          3|    387|
    |Store 2|          4|    312|
    |   null|       null|   null|
    |   null|         10|   null|
    +-------+-----------+-------+
    

    Compute mean per window, and replace revenue with mean whenever it is null.

    from pyspark.sql.functions import *
    from pyspark.sql.window import Window
    from pyspark.sql import functions as F
    
    w=Window().partitionBy("Store")
    
    df.withColumn("mean", F.mean("Revenue").over(w))\
        .withColumn("Revenue", F.when(col("Revenue").isNull(), col("mean"))\
        .otherwise(col("Revenue"))).drop("mean").show()
    
    +-------+-----------+-----------------+                                         
    |  Store|WeekInMonth|          Revenue|
    +-------+-----------+-----------------+
    |Store 1|          1|            448.0|
    |Store 1|          2|459.6666666666667|
    |Store 1|          3|            499.0|
    |Store 1|         44|            432.0|
    |Store 2|       null|            345.0|
    |Store 2|       null|            555.0|
    |Store 2|          1|            355.0|
    |Store 2|          1|            355.0|
    |Store 2|          3|            387.0|
    |Store 2|          4|            312.0|
    |   null|       null|             null|
    |   null|         10|             null|
    +-------+-----------+-----------------+