Search code examples
python-3.xpysparkapache-spark-sqldatabricksazure-databricks

Dynamic Columns .withColumn Python DataFrame


I wanted to apply .withColumn dynamically on my Spark DataFrame with column names in list

from pyspark.sql.functions import col 
from pyspark.sql.types import BooleanType

def get_dtype(dataframe,colname):
    return [dtype for name, dtype in dataframe.dtypes if name == colname][0] 
def get_matches(dataframe):
  return [x for x in dataframe.columns if get_dtype(dataframe,x)=='tinyint']

matches = get_matches(srcpartyaddressDF)
matches

Above code give me list of columns where column datatype is 'tinyint'

Result:

Out[67]: ['verified_flag', 'standard_flag', 'overseas_flag', 'active']

Now I want to do below for each column from list matches dynamically

partyaddressDF = srcpartyaddressDF.withColumn("verified_flag", col("verified_flag").cast(BooleanType())).withColumn("standard_flag", col("standard_flag").cast(BooleanType())).withColumn("overseas_flag", col("overseas_flag").cast(BooleanType())).withColumn("active", col("active").cast(BooleanType()))

How can this be acheived in Python3


Solution

  • you can do something like this:

    # import is necessary only for python 3
    from functools import reduce
    
    def do_cast(df, cl):
        return df.withColumn(cl, col(cl).cast(BooleanType()))
    
    matches = ['verified_flag', 'standard_flag', 'overseas_flag', 'active']
    partyaddressDF = reduce(do_cast, matches, srcpartyaddressDF)
    

    basically, it takes initial value (srcpartyaddressDF), and apply first item from list (column name), then takes 2nd value from list, and use it with result that was obtained on first execution, then 3rd value, ...