I have a multi-column pyspark dataframe, and I need to convert the string types to the correct types, for example:
I'm doing like this currently
df = df.withColumn(col_name, col(col_name).cast('float') \
.withColumn(col_id, col(col_id).cast('int') \
.withColumn(col_city, col(col_city).cast('string') \
.withColumn(col_date, col(col_date).cast('date') \
.withColumn(col_code, col(col_code).cast('bigint')
is it possible to create a list with the types and pass it at once to all columns?
You just need to have some mapping as dictionary, or something like, and then generate correct select
statement (you can use withColumn
, but usually it can lead to performance problems). Something like this:
import pyspark.sql.functions as F
mapping = {'col1':'float', ....}
df = .... # your input data
rest_cols = [F.col(cl) for cl in df.columns if cl not in mapping]
conv_cols = [F.col(cl_name).cast(cl_type).alias(cl_name)
for cl_name, cl_type in mapping.items()
if cl_name in df.columns]
conv_df.select(*rest_cols, *conv_cols)