Search code examples
apache-sparkdataframeapache-spark-sqlpyspark

Save a large Spark Dataframe as a single json file in S3


Im trying to save a Spark DataFrame (of more than 20G) to a single json file in Amazon S3, my code to save the dataframe is like this :

dataframe.repartition(1).save("s3n://mybucket/testfile","json")

But im getting an error from S3 "Your proposed upload exceeds the maximum allowed size", i know that the maximum file size allowed by Amazon is 5GB.

Is it possible to use S3 multipart upload with Spark? or there is another way to solve this?

Btw i need the data in a single file because another user is going to download it after.

*Im using apache spark 1.3.1 in a 3-node cluster created with the spark-ec2 script.

Thanks a lot

JG


Solution

  • I would try separating the large dataframe into a series of smaller dataframes that you then append into the same file in the target.

    df.write.mode('append').json(yourtargetpath)