Search code examples
pythonpysparkapache-spark-sqlpyspark-pandas

How to replace any null in pyspark df with value from the below row, same column


Let's say I have a pyspark DF:

| Column A | Column B |
| -------- | -------- |
| val1     | val1B    |
| null     | val2B    |
| val2     | null     |
| val3     | val3B    |

Can someone help me with replacing any null value in any column (for the whole df) with the value right below it? So the final table should look like this:

Column A Column B
val1 val1B
val2 val2B
val2 val3B
val3 val3B

How could this be done? Can I get a code demo if possible? Thank you!

All I've really gotten through is counting all the row nums and creating a condition to find the row nums with all of the null values. So I'm left with a table like this:

Column A Column B row_num
null val2B 2
val2 null 3

But I don't think this step is needed. I'm stuck as to what to do.


Solution

  • Use list squares to coalesce each column with the lead window function. Code below

    df.select(*[coalesce(col(x),lead(x).over(Window.partitionBy().orderBy( monotonically_increasing_id()))).alias(x) for x in df.columns]).show()
    
    
    +--------+--------+
    |Column A|Column B|
    +--------+--------+
    |    val1|   val1B|
    |    val2|   val2B|
    |    val2|   val3B|
    |    val3|   val3B|
    +--------+--------+