Search code examples
pythonapache-sparkpysparkhadoop-yarn

spark on yarn cluster creates a spark job with the number of workers that is much smaller than what is specified in the spark context


Spark on yarn cluster creates a spark job with the number of workers that is much smaller (only 4 workers) than what is specified in the spark context (100): here is how I create the spark context and session:

config_list = [
    ('spark.yarn.dist.archives','xxxxxxxxxxx'),
    ('spark.yarn.appMasterEnv.PYSPARK_PYTHON','xxxxxxxxx'),
    ('spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON','xxxxxxxxxxx'),
    ('spark.local.dir','xxxxxxxxxxxxxxxxxx'),
    ('spark.submit.deployMode','client'),
    ('spark.yarn.queue','somequeue'),
    ('spark.dynamicAllocation.minExecutors','100'),
    ('spark.dynamicAllocation.maxExecutors','100'),
    ('spark.executor.instances','100'),
    ('spark.executor.memory','40g'),
    ('spark.driver.memory','40g'),
    ('spark.yarn.executor.memoryOverhead','10g')
]

conf = pyspark.SparkConf().setAll(config_list)

spark = SparkSession.builder.master('yarn')\
    .config(conf=conf)\
    .appName('myapp')\
    .getOrCreate()

sc = spark.sparkContext

would appreciate any ideas


Solution

  • The spark session will allocate maximum amount of free worker nodes available at the time of running of your job if you specify the minimum worker nodes to be greater than equal to your actual workers/executors present in your cluster.

    You can also verify this by seeing the number of executors allocated in the session by using below:

    sc._conf.get('spark.executor.instances')
    
    

    I hope you understand