apache-spark pyspark apache-spark-sql hadoop2

Pyspark 2.4.0 hadoopConfiguration to write to S3

Pyspark version 2.4.0

I'm writing files to an S3 I don't own. Then everyone is having trouble reading the file. I think the issue is similar to this How to assign the access control list (ACL) when writing a CSV file to AWS in pyspark (2.2.0)?

But the solution seems no longer working. Searched across Pyspark doc but didn't get an answer. I tried:

from pyspark.sql import SparkSession
spark = SparkSession.\
    builder.\
    master("yarn").\
    appName(app_name).\
    enableHiveSupport().\
    getOrCreate()
spark.sparkContext.hadoopConfiguration.set("fs.s3a.acl.default", "BucketOwnerFullControl")

This is giving me: ERROR - {"exception": "'SparkContext' object has no attribute 'hadoopConfiguration'"

Solution

There's two issues at hand.

In order to set new config, you need to getOrCreate() your SparkSession again with the new config. You won't be able to just set. For example:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
sc = spark.sparkContext
conf = pyspark.SparkConf().setAll([('spark.executor.memory', '1g')])

# stop the sparkContext and set new conf
sc.stop()
spark = SparkSession.builder.config(conf=conf).getOrCreate()

In order to set Hadoop Config, you need to prepend them with spark.hadoop. This means your config will become

conf = pyspark.SparkConf().setAll([("spark.hadoop.fs.s3a.acl.default", "BucketOwnerFullControl")])

Hope this helps.