Search code examples
apache-sparkpyspark

Configurations Usage in Pyspark


I came across various methods for configuring settings in PySpark. I'd like to understand when each of the following should be used. Are all three of the options below equivalent?

spark.conf.set()
spark.sparkContext._conf.set()
spark.sparkContext._jsc.hadoopConfiguration().set()

Solution

  • The first two are identical and the latter you will probably never have to use. The "_" at the start of the name is meant to indicate you shouldn't use them.

    spark.conf wraps self._jsparkSession.conf(). The context variety setConf calls self.sparkSession.conf.set i.e. context delegates to session conf.

    wrt to hadoop configuration you are better off using spark session builders to access it e.g.:

    sparkSessionBuilder.config("spark.hadoop.fs.file.impl", "..")
    

    The spark.hadoop pattern is then used when bootstrapping any hadoop functionality, this is needed when using barenakedfs on windows for example. Using the hadoop configuration directly can already force bootstrapping rendering that route useless (particularly for fs').

    edit: (the above wasn't clear enough)...

    This code:

    SparkSession spark = SparkSession.builder().appName("Foo Bar").master("local").getOrCreate();
    spark.sparkContext().hadoopConfiguration().setClass("fs.file.impl", BareLocalFileSystem.class, FileSystem.class);
    

    will fail to load the filesystem because calling getOrCreate already forces the fs registration.

    If you aren't looking to control spark behaviour but to allow your udf's to use config set from the driver you should prefer constants defined in your code. The session, and therefore the config, do not exist on the executors and the internal config sharing (SQLConf) is not exposed to users in the python api.