Search code examples
apache-sparkamazon-s3aws-glue

Write to S3 bucket with limited permissions using Apache Spark


I am using the S3a protocol to write into a bucket that belongs to someone else. I am allowed to use only a limited set of S3 Actions (I don't know which exactly).

When trying to write the data with spark using AWS Glue, I get an error 403: AccessDenied.

Using s3distcp works from EMR but I would have to change how the infrastructure is set up. Using a bucket with all S3 actions allowed works as well but I guess the bucket owner would not want to change the permissions.

Is there a way to tell spark to write the data without requiring to have so many permissions?

Edit: Spark needs the S3:DeleteObject permission. Is there a way to circumvent this?

Here is the code:

sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.bucket.some-bucket.access.key", "accesskey")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.bucket.some-bucket.secret.key", "secretkey")

data.write.csv(s"s3a://some-bucket/test")

Solution

  • Spark needs the S3:DeleteObject permission. Is there a way to circumvent this?

    no

    Needed to

    • prune directory marker objects
    • implement rename() as copy + delete
    • clean up job attempt directories
    • delete directory trees before writing to them

    Hadoop 3.1+ S3A connector should be able to cope without delete access all the way up the tree. Negotiate with the admin team for your IAM account to have delete rights on the path of the bucket where all output goes