I am running a spark-submit job on Spark 2.2.0 with Scala 2.11.11, SBT on Mesos 1.4.2.
I have problem with tasks being lost and executors not being registered. Here are the symptoms:
MesosCoarseGrainedSchedulerBackend launches tasks until spark.cores.max is reached. For example here it launches 6 tasks:
18/06/11 12:49:54 DEBUG MesosCoarseGrainedSchedulerBackend: Received 2 resource offers.
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585462 with attributes: Map() mem: 423417.0 cpu: 55.5 ports: List((1025,2180), (2182,3887), (3889,5049), (5052,5507), (5509,8079), (8082,8180), (8182,8792), (8794,9177), (9179,12396), (12398,16297), (16299,16839), (16841,18310), (18312,21795), (21797,22269), (22271,32000)). Launching 2 Mesos tasks.
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 2 with mem: 11264.0 cpu: 20.0 ports:
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 0 with mem: 11264.0 cpu: 20.0 ports:
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585463 with attributes: Map() mem: 300665.0 cpu: 71.5 ports: List((1025,2180), (2182,2718), (2721,3887), (3889,5049), (5052,5455), (5457,8079), (8082,8180), (8182,8262), (8264,8558), (8560,8792), (8794,10231), (10233,16506), (16508,18593), (18595,32000)). Launching 3 Mesos tasks.
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 4 with mem: 11264.0 cpu: 20.0 ports:
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 3 with mem: 11264.0 cpu: 20.0 ports:
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 1 with mem: 11264.0 cpu: 20.0 ports:
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Received 2 resource offers.
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585464 with attributes: Map() mem: 423417.0 cpu: 55.5 ports: List((1025,2180), (2182,3887), (3889,5049), (5052,5507), (5509,8079), (8082,8180), (8182,8792), (8794,9177), (9179,12396), (12398,16297), (16299,16839), (16841,18310), (18312,21795), (21797,22269), (22271,32000)). Launching 1 Mesos tasks.
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 5 with mem: 11264.0 cpu: 20.0 ports:
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Declining offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585465 with attributes: Map() mem: 300665.0 cpu: 71.5 port: List((1025,2180), (2182,2718), (2721,3887), (3889,5049), (5052,5455), (5457,8079), (8082,8180), (8182,8262), (8264,8558), (8560,8792), (8794,10231), (10233,16506), (16508,18593), (18595,32000)) for 120 seconds (reason: reached spark.cores.max)
Then immediately afterwards it starts to lose tasks and blacklist slaves even thought I have set spark.blacklist.enabled=false
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 2 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 0 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave a6031461-f185-424d-940e-b45fb64a2aaf-S0 due to too many failures; is Spark installed on it?
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 4 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 3 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave a6031461-f185-424d-940e-b45fb64a2aaf-S1 due to too many failures; is Spark installed on it?
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave a6031461-f185-424d-940e-b45fb64a2aaf-S1 due to too many failures; is Spark installed on it?
Afterwards non-existent
executors are removed
18/06/11 12:49:56 DEBUG MesosCoarseGrainedSchedulerBackend: Received 2 resource offers.
18/06/11 12:49:56 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 2 with reason Executor finished with state LOST
18/06/11 12:49:56 INFO BlockManagerMaster: Removal of executor 2 requested
18/06/11 12:49:56 DEBUG MesosCoarseGrainedSchedulerBackend: Declining offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585466 with attributes: Map() mem: 300665.0 cpu: 71.5 port: List((1025,2180), (2182,2718), (2721,3887), (3889,5049), (5052,5455), (5457,8079), (8082,8180), (8182,8262), (8264,8558), (8560,8792), (8794,10231), (10233,16506), (16508,18593), (18595,32000))
18/06/11 12:49:56 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 2
18/06/11 12:49:56 DEBUG MesosCoarseGrainedSchedulerBackend: Declining offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585467 with attributes: Map() mem: 412153.0 cpu: 35.5 port: List((1025,2180), (2182,3887), (3889,5049), (5052,5507), (5509,8079), (8082,8180), (8182,8792), (8794,9177), (9179,12396), (12398,16297), (16299,16839), (16841,18310), (18312,21795), (21797,22269), (22271,32000))
18/06/11 12:49:56 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 0 with reason Executor finished with state LOST
18/06/11 12:49:56 INFO BlockManagerMaster: Removal of executor 0 requested
18/06/11 12:49:56 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
18/06/11 12:49:56 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 4 with reason Executor finished with state LOST
18/06/11 12:49:59 INFO BlockManagerMaster: Removal of executor 4 requested
18/06/11 12:49:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 4
18/06/11 12:49:59 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 3 with reason Executor finished with state LOST
18/06/11 12:49:59 INFO BlockManagerMaster: Removal of executor 3 requested
18/06/11 12:49:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 3
18/06/11 12:49:59 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 1 with reason Executor finished with state LOST
18/06/11 12:49:59 INFO BlockManagerMaster: Removal of executor 1 requested
18/06/11 12:49:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
18/06/11 12:49:59 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 5 is now TASK_RUNNING
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
Note however that one single task 5 was not lost and the executor 5 was not removed
18/06/11 12:49:59 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 5 is now TASK_RUNNING
18/06/11 12:50:01 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (SlaveIp:46884) with ID 5
18/06/11 12:50:01 INFO BlockManagerMasterEndpoint: Registering block manager SpaveIP:32840 with 5.2 GB RAM, BlockManagerId(5, SlaveIP, 32840, None)
Here is my SparkSession setup:
val spark = SparkSession.builder
.config("spark.executor.cores", 20)
.config("spark.executor.memory", "10g")
.config("spark.sql.shuffle.partitions", numPartitionsShuffle)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.network.timeout", "1200s")
.config("spark.blacklist.enabled", false)
.config("spark.blacklist.maxFailedTaskPerExecutor", 100)
.config("spark.dynamicAllocation.enabled", false)
.getOrCreate()
And here is my spark-submit script
spark-submit \
--class MyMainClass \
--master mesos://masterIP:7077 \
--total-executor-cores 120 \
--driver-memory 200g \
--deploy-mode cluster \
--name MyMainClass \
--conf "spark.shuffle.service.enabled=false" \
--conf "spark.dynamicAllocation.enabled=false" \
--conf "spark.blacklist.enabled=false" \
--conf "spark.blacklist.maxFailedTaskPerExecutor=100" \
--verbose \
myJar-assembly-0.1.0-SNAPSHOT.jar
Note:
config
vs --conf
is not always clear.TASK_RUNNING
status. spark.blacklist.maxFailedTaskPerExecutor
and spark.blacklist.enabled
doesn't seem to workI am completely lost here what is going on.
Questions:
I am answering my own question:
We found that our problem is two fold.
As a result:
Solution:
I am not sure if this is general solution to this problem but we somewhat blindly searched for configs that regulate different mesos timeout
mechanisms and we found this bug in Mesos 1.4:
As a test we have set the SparkSession
config spark.mesos.driver.failoverTimeout=1.0
. This seemed to solve our problem. We don't loose our executors anymore at the beginning of the job.