Search code examples
apache-sparkpysparkapache-spark-sql

Spark session value not updating


I am setting spark session value using below code

    spark = (SparkSession
                        .builder
                        .appName('LoadDev1')
                        #.config("spark.master","local[2]")
                        .config("spark.master","yarn")
                        .config("spark.yarn.queue","uldp")
                        .config("spark.tez.queue","uldp")
                        .config("spark.executor.instances","5")
                        .enableHiveSupport()
                        .getOrCreate()
                  )
           return spark

spark-submit --jars /app/spark3.3.1/jars/iceberg-spark-runtime-3.3_2.12-1.1.0.jar  --conf spark.sql.shuffle.partitions=100 --conf spark.hive.vectorized.execution.enabled=false --py-files 
/home/path/SparkFactory_iceberg1.py

But when I print the values inside my program for example spark.executor.instances value = 10 although I can see that appname is changing if I am changing the config file which make me believe that this config file is indeed read but somehow values are overwritten.

If I provide the value using --conf it is reflected but I want to use config file rather than --conf.

Please help me with this.


Solution

  • spark-submit is a completely separate application than the session you create at the top, so you need to pass those configs into the spark-submit command, and you can create a properties-file which will override the default spark config at conf/spark-defaults.conf with the configs like this:

    app.conf

    spark.master yarn
    spark.yarn.queue uldp
    spark.tez.queue uldp
    spark.executor.instances 5
    spark.sql.shuffle.partitions 100
    spark.hive.vectorized.execution.enabled false
    
    $ spark-submit \
        --properties-file <PATH>/app.conf \
        --jars /app/spark3.3.1/jars/iceberg-spark-runtime-3.3_2.12-1.1.0.jar \
        --py-files /home/path/SparkFactory_iceberg1.py \
        /home/path/main.py