Search code examples
pysparkexport-to-csvapache-zeppelin

Pyspark over zeppilin: unable to export to csv format?


I'm trying to export the dataframe into .csv file to S3 bucket.

Unfortunately it is saving in parquet files.

Can someone please let me know, how to get export pyspark dataframe into .csv file.

I tried below code: predictions.select("probability").write.format('csv').csv('s3a://bucketname/output/x1.csv')

it is throwing this error: CSV data source does not support struct,values:array> data type.

Appreciate anybody help.

Note: my spark setup is based in zepplin.

Thanks, Naseer


Solution

  • Probability is an array column (contains multiple values) and needs to be converted to string before you can save it to csv. One way to do it is using udf (user defined function):

    from pyspark.sql.functions import udf
    from pyspark.sql.functions import column as col
    from pyspark.sql.types import StringType
    
    def string_from_array(input_list):
        return ('[' + ','.join([str(item) for item in input_list]) + ']')
    
    ats_udf = udf(string_from_array, StringType())
    
    predictions = predictions.withColumn('probability_string', ats_udf (col("probability")))
    

    Then you can save your dataset:

    predictions.select("probability_string").write.csv('s3a://bucketname/output/x1.csv')