I wanted to apply .withColumn
dynamically on my Spark DataFrame with column names in list
from pyspark.sql.functions import col
from pyspark.sql.types import BooleanType
def get_dtype(dataframe,colname):
return [dtype for name, dtype in dataframe.dtypes if name == colname][0]
def get_matches(dataframe):
return [x for x in dataframe.columns if get_dtype(dataframe,x)=='tinyint']
matches = get_matches(srcpartyaddressDF)
matches
Above code give me list of columns where column datatype is 'tinyint
'
Result:
Out[67]: ['verified_flag', 'standard_flag', 'overseas_flag', 'active']
Now I want to do below for each column from list matches
dynamically
partyaddressDF = srcpartyaddressDF.withColumn("verified_flag", col("verified_flag").cast(BooleanType())).withColumn("standard_flag", col("standard_flag").cast(BooleanType())).withColumn("overseas_flag", col("overseas_flag").cast(BooleanType())).withColumn("active", col("active").cast(BooleanType()))
How can this be acheived in Python3
you can do something like this:
# import is necessary only for python 3
from functools import reduce
def do_cast(df, cl):
return df.withColumn(cl, col(cl).cast(BooleanType()))
matches = ['verified_flag', 'standard_flag', 'overseas_flag', 'active']
partyaddressDF = reduce(do_cast, matches, srcpartyaddressDF)
basically, it takes initial value (srcpartyaddressDF
), and apply first item from list (column name), then takes 2nd value from list, and use it with result that was obtained on first execution, then 3rd value, ...