Search code examples
pythonpysparklag

I want to fill Missing value with last row value in Pyspark:


My df has multiple columns

Query I tried:

df=df.withColumn('Column_required',F.when(df.Column_present>1,df.Column_present).otherwise(lag(df.Column_present))

Not able to work on otherwise. . Column on which I want operation:

Column_present       Column_required
40000                 40000
Null                  40000
Null                  40000
500                   500
Null                  500
Null                  500

Solution

  • I think your solution might be the usage of last instead of lag:

    df_new = spark.createDataFrame([
    (1, 40000), (2, None),  (3,None), (4,None),
    (5,500), (6,None), (7,None)
    ], ("id", "Col_present"))
    
    df_new.withColumn('Column_required',when(df_new.Col_present>1,df_new.Col_present).otherwise(last(df_new.Col_present,ignorenulls=True).over(Window.orderBy("id")))).show()
    

    This will produce your desired output:

    +---+-----------+---------------+
    | id|Col_present|Column_required|
    +---+-----------+---------------+
    |  1|      40000|          40000|
    |  2|       null|          40000|
    |  3|       null|          40000|
    |  4|       null|          40000|
    |  5|        500|            500|
    |  6|       null|            500|
    |  7|       null|            500|
    +---+-----------+---------------+
    

    But be aware, that the window function requires a column to perform the sorting. That's why I used the id column in the example. You can create an id column by yourself, if your dataframe does not contain a sortable column with monotonically_increasing_id().