I've a DataFrame with a lot of columns. Some of these columns are of the type array<string>
.
I need to export a sample to csv and csv doesn't support array.
Now I'm doing this for every array column (sometimes is miss one or more)
df_write = df\
.withColumn('col_a', F.concat_ws(',', 'col_a'))\
.withColumn('col_g', F.concat_ws(',', 'col_g'))\
....
Is there a way to use a loop and do this for every array column without specifying them one by one?
You can check the type of each column and do a list comprehension:
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType
arr_col = [
i.name
for i in df.schema
if isinstance(i.dataType, ArrayType)
]
df_write = df.select([
F.concat_ws(',', c)
if c in arr_col
else F.col(c)
for c in df.columns
])
Actually, you don't need to use concat_ws
. You can just cast all columns to string type before writing to CSV, e.g.
df_write = df.select([F.col(c).cast('string') for c in df.columns])