Search code examples
apache-sparkamazon-s3

Spark Write to S3 Storage Option


I am saving a spark dataframe to S3 bucket. The default storage type for the saved file is STANDARD. I need it to be STANDARD_IA. What is the option to achieve this. I have looked into the spark source codes and found no such options for spark DataFrameWriter in https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Below is the code I am using to write to S3:

val df = spark.sql(<sql>)
df.coalesce(1).write.mode("overwrite").parquet(<s3path>)

Edit: I am now using CopyObjectRequest to change the storage type of the created parquet:

val copyObjectRequest = new CopyObjectRequest(bucket, key, bucket, key).withStorageClass(<storageClass>)
s3Client.copyObject(copyObjectRequest)

Solution

  • As of July 2022 this has been implemented in the hadoop source tree in HADOOP-12020 by AWS S3 engineers.

    It is still stabilising and should be out in the next feature release of hadoop 3.3.x, due late 2022.

    • anyone reading this before it ships: code is there to build yourself.
    • anyone readying this in 2023+. upgrade to hadoop 3.3.5 or later