Search code examples
arraysapache-sparkpysparkapache-spark-sqlexport-to-csv

Export array<string> type to csv using PySpark without specifying them one by one?


I've a DataFrame with a lot of columns. Some of these columns are of the type array<string>.
I need to export a sample to csv and csv doesn't support array. Now I'm doing this for every array column (sometimes is miss one or more)

df_write = df\
.withColumn('col_a', F.concat_ws(',', 'col_a'))\
.withColumn('col_g', F.concat_ws(',', 'col_g'))\
....

Is there a way to use a loop and do this for every array column without specifying them one by one?


Solution

  • You can check the type of each column and do a list comprehension:

    import pyspark.sql.functions as F
    from pyspark.sql.types import ArrayType
    
    arr_col = [
        i.name
        for i in df.schema
        if isinstance(i.dataType, ArrayType)
    ]
    
    df_write = df.select([
        F.concat_ws(',', c)
        if c in arr_col
        else F.col(c)
        for c in df.columns
    ])
    

    Actually, you don't need to use concat_ws. You can just cast all columns to string type before writing to CSV, e.g.

    df_write = df.select([F.col(c).cast('string') for c in df.columns])