For a sample spark df:
sdf = ({'id_col': [25, 13, 15],
'x2': ['b', None, '2'],
'x3': [None, '0', '3'] })
which I would to have:
sdf = ({'id_col': [25, 13, 15],
'x2': [1, 0, 1],
'x3': [0, 1, 1] })
replace all null with 0, and notnull with 1.
Following the syntax from: Filling not null values as 1 in pyspark dataframe. it went trough all the columns of the sdf. For my case, I need to keep the values of id_col as is but replace with 1 and 0 on cols x2 and x3.
I initially tried the following:
I put the x2 and x3 cols in a list:
list = ['x2','x3']
then applied this:
col_selection = [when(col(c).isNull(),0).otherwise(1).alias(c) for c in sdf[[list]].columns]
I get the intended 0s and 1s, but I only get the x2 and x3 columns. When I called the sdf, the changes did not happen. If I have 17 columns to go through in a 25-column sdf, can you guide me on how to apply the for loop on only the 17 columns?
Thanks
If possible to do one by one
from pyspark.sql import functions as F
import pandas as pd
sdf = pd.DataFrame({'id_col': [25, 13, 15],
'x2': ['b', None, '2'],
'x3': [None, '0', '3'] })
sdf=spark.createDataFrame(sdf)
sdf.withColumns({"x2":F.when(F.col("x2").isNull(),0).otherwise(1),"x3":F.when(F.col("x3").isNull(),0).otherwise(1)}).show()
#output
+------+---+---+
|id_col| x2| x3|
+------+---+---+
| 25| 1| 0|
| 13| 0| 1|
| 15| 1| 1|
+------+---+---+
Note-withColumns is only available in spark version >= 3.3.0
To do in loop
lis=['x2','x3']
for a in lis:
sdf=sdf.withColumn(f"{a}",F.when(F.col(f"{a}").isNull(),0).otherwise(1))
sdf.show()
#output
+------+---+---+
|id_col| x2| x3|
+------+---+---+
| 25| 1| 0|
| 13| 0| 1|
| 15| 1| 1|
+------+---+---+