python pyspark apache-spark-sql pyspark-pandas

How to replace any null in pyspark df with value from the below row, same column

Let's say I have a pyspark DF:

| Column A | Column B |
| -------- | -------- |
| val1     | val1B    |
| null     | val2B    |
| val2     | null     |
| val3     | val3B    |

Can someone help me with replacing any null value in any column (for the whole df) with the value right below it? So the final table should look like this:

Column A	Column B
val1	val1B
val2	val2B
val2	val3B
val3	val3B

How could this be done? Can I get a code demo if possible? Thank you!

All I've really gotten through is counting all the row nums and creating a condition to find the row nums with all of the null values. So I'm left with a table like this:

Column A	Column B	row_num
null	val2B	2
val2	null	3

But I don't think this step is needed. I'm stuck as to what to do.

Solution

Use list squares to coalesce each column with the lead window function. Code below

df.select(*[coalesce(col(x),lead(x).over(Window.partitionBy().orderBy( monotonically_increasing_id()))).alias(x) for x in df.columns]).show()


+--------+--------+
|Column A|Column B|
+--------+--------+
|    val1|   val1B|
|    val2|   val2B|
|    val2|   val3B|
|    val3|   val3B|
+--------+--------+