I'm trying to export the dataframe into .csv file to S3 bucket.
Unfortunately it is saving in parquet files.
Can someone please let me know, how to get export pyspark dataframe into .csv file.
I tried below code: predictions.select("probability").write.format('csv').csv('s3a://bucketname/output/x1.csv')
it is throwing this error: CSV data source does not support struct,values:array> data type.
Appreciate anybody help.
Note: my spark setup is based in zepplin.
Thanks, Naseer
Probability is an array column (contains multiple values) and needs to be converted to string before you can save it to csv. One way to do it is using udf (user defined function):
from pyspark.sql.functions import udf
from pyspark.sql.functions import column as col
from pyspark.sql.types import StringType
def string_from_array(input_list):
return ('[' + ','.join([str(item) for item in input_list]) + ']')
ats_udf = udf(string_from_array, StringType())
predictions = predictions.withColumn('probability_string', ats_udf (col("probability")))
Then you can save your dataset:
predictions.select("probability_string").write.csv('s3a://bucketname/output/x1.csv')