Search code examples
apache-sparkpyspark

Retrieving the specific configurations


I have not specified any spark properties in my application and going with default values. How do we find the specific value of a specific spark properties being used?

In below case, why we are not able to find value of spark.executor.cores and running into error?

  • On running the below, getting the error message - org.apache.spark.SparkNoSuchElementException: [SQL_CONF_NOT_FOUND] The SQL config "spark.executor.cores" cannot be found. Please verify that the config exists.

    • Same error for spark.executor.instances.
print(spark.conf.get("spark.executor.cores"))
  • On running the below, I get a value of 200
print(spark.conf.get("spark.sql.shuffle.partitions"))
  • On running the below, do not see both of the above configurations mentioned in the result.
print(spark.sparkContext.getConf().getAll())

Solution

  • You're encountering 2 different types of configuration parameters that are confusing you here:

    • SparkContext config parameters: these are generic to your cluster. You can access those by using spark.sparkContext.getConf().get("my-conf")). They are generic to your application, whether or not you're using Spark SQL.
    • Spark SQL config parameters: configuration parameters that start with spark.sql.. These are relevant when you're using Spark SQL (when using DataFrames for example). You can access these by using spark.conf.get("my-sql-conf").

    Now we know this, we understand why

    Let's answer the 3 following questions:

    Why does print(spark.conf.get("spark.executor.cores")) return nothing?

    spark.executor.cores is not a SQL configuration parameter so is not accessible with spark.conf.get().

    Why does print(spark.conf.get("spark.sql.shuffle.partitions")) return 200?

    Since this config param starts with spark.sql, it IS a SQL config parameter. So it's accessible with spark.conf.get().

    But why don't you find spark.executor.cores in the output of the following command?

    print(spark.sparkContext.getConf().getAll())
    

    The reason why is that you probably started your SparkContext by explicitly setting that configuration parameter. In that case, they don't show up in this list that is the output of .getAll(). Of course, they do have values but the value will just be the default value.

    For the following code, I will be using spark.sparkContext.getConf().get() instead of .getAll() but they essentially do the same. .get() just gives the value of a specific config param.

    If I start up a pyspark shell without explicitly setting that configuration parameter, I get the following:

    pyspark
    
    # Wait until the pyspark REPL is up and running
    >>> print(spark.sparkContext.getConf().get("spark.executor.cores"))
    None
    

    If I start up a pyspark shell and explicitly set the spark.executor.cores value, I get the following:

    pyspark --conf spark.executor.cores=3
    
    # Wait until the pyspark REPL is up and running
    >>> print(spark.sparkContext.getConf().get("spark.executor.cores"))
    3
    

    Conclusion

    There are 2 things to remember here:

    • SparkContext config params != Spark SQL config parameters
    • The output of spark.sparkContext.getConf().get() does not always return a value for a configuration parameter. For some configuration parameters, it only returns a value if you explicitly set that value.