Let's say I have a pyspark DF:
| Column A | Column B |
| -------- | -------- |
| val1 | val1B |
| null | val2B |
| val2 | null |
| val3 | val3B |
Can someone help me with replacing any null value in any column (for the whole df) with the value right below it? So the final table should look like this:
Column A | Column B |
---|---|
val1 | val1B |
val2 | val2B |
val2 | val3B |
val3 | val3B |
How could this be done? Can I get a code demo if possible? Thank you!
All I've really gotten through is counting all the row nums and creating a condition to find the row nums with all of the null values. So I'm left with a table like this:
Column A | Column B | row_num |
---|---|---|
null | val2B | 2 |
val2 | null | 3 |
But I don't think this step is needed. I'm stuck as to what to do.
Use list squares to coalesce each column with the lead window function. Code below
df.select(*[coalesce(col(x),lead(x).over(Window.partitionBy().orderBy( monotonically_increasing_id()))).alias(x) for x in df.columns]).show()
+--------+--------+
|Column A|Column B|
+--------+--------+
| val1| val1B|
| val2| val2B|
| val2| val3B|
| val3| val3B|
+--------+--------+