Search code examples
apache-sparkpysparkapache-spark-sqlhadoop2

Pyspark 2.4.0 hadoopConfiguration to write to S3


Pyspark version 2.4.0

I'm writing files to an S3 I don't own. Then everyone is having trouble reading the file. I think the issue is similar to this How to assign the access control list (ACL) when writing a CSV file to AWS in pyspark (2.2.0)?

But the solution seems no longer working. Searched across Pyspark doc but didn't get an answer. I tried:

from pyspark.sql import SparkSession
spark = SparkSession.\
    builder.\
    master("yarn").\
    appName(app_name).\
    enableHiveSupport().\
    getOrCreate()
spark.sparkContext.hadoopConfiguration.set("fs.s3a.acl.default", "BucketOwnerFullControl")

This is giving me: ERROR - {"exception": "'SparkContext' object has no attribute 'hadoopConfiguration'"


Solution

  • There's two issues at hand.

    1. In order to set new config, you need to getOrCreate() your SparkSession again with the new config. You won't be able to just set. For example:
    import pyspark
    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder.master("local").getOrCreate()
    sc = spark.sparkContext
    conf = pyspark.SparkConf().setAll([('spark.executor.memory', '1g')])
    
    # stop the sparkContext and set new conf
    sc.stop()
    spark = SparkSession.builder.config(conf=conf).getOrCreate()
    
    1. In order to set Hadoop Config, you need to prepend them with spark.hadoop. This means your config will become
    conf = pyspark.SparkConf().setAll([("spark.hadoop.fs.s3a.acl.default", "BucketOwnerFullControl")])
    

    Hope this helps.