Search code examples
pythonapache-sparkpysparkapache-spark-sql

Is it possible to cast multiple columns of a dataframe in pyspark?


I have a multi-column pyspark dataframe, and I need to convert the string types to the correct types, for example:

I'm doing like this currently

df = df.withColumn(col_name, col(col_name).cast('float') \
.withColumn(col_id, col(col_id).cast('int') \
.withColumn(col_city, col(col_city).cast('string') \
.withColumn(col_date, col(col_date).cast('date') \
.withColumn(col_code, col(col_code).cast('bigint')

is it possible to create a list with the types and pass it at once to all columns?


Solution

  • You just need to have some mapping as dictionary, or something like, and then generate correct select statement (you can use withColumn, but usually it can lead to performance problems). Something like this:

    import pyspark.sql.functions as F
    mapping = {'col1':'float', ....}
    df = .... # your input data
    rest_cols = [F.col(cl) for cl in df.columns if cl not in mapping]
    conv_cols = [F.col(cl_name).cast(cl_type).alias(cl_name) 
       for cl_name, cl_type in mapping.items()
       if cl_name in df.columns]
    conv_df.select(*rest_cols, *conv_cols)