Search code examples
apache-spark

Spark executors always EXIT


I have a 2 nodes cluster with spark installed as follows:

  • First node : A master and one worker
  • Second node : Another worker

Each time I start the cluster with the command $SPARK_HOME/sbin/start-all.sh, I encounter an anomaly in the second worker; Executors are created and closed immediately. This results in an abnormal multiplication of the number of executors and very often an error in execution when I submit jobs.

abnormal worker

in master logs, we can see it launching and removing second executor (172.21.113.120)

Spark Command: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.222.b03-1.el7.x86_64/bin/java -cp /spark/spark-2.4.4/conf/:/spark/spark-2.4.4/assembly/target/scala-2.11/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host 172.21.113.119 --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/02/15 17:15:44 INFO Master: Started daemon with process name: [email protected]
20/02/15 17:15:44 INFO SignalUtils: Registered signal handler for TERM
20/02/15 17:15:44 INFO SignalUtils: Registered signal handler for HUP
20/02/15 17:15:44 INFO SignalUtils: Registered signal handler for INT
20/02/15 17:15:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/02/15 17:15:45 INFO SecurityManager: Changing view acls to: root
20/02/15 17:15:45 INFO SecurityManager: Changing modify acls to: root
20/02/15 17:15:45 INFO SecurityManager: Changing view acls groups to: 
20/02/15 17:15:45 INFO SecurityManager: Changing modify acls groups to: 
20/02/15 17:15:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/02/15 17:15:46 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
20/02/15 17:15:46 INFO Master: Starting Spark master at spark://172.21.113.119:7077
20/02/15 17:15:46 INFO Master: Running Spark version 2.4.4
20/02/15 17:15:46 INFO Utils: Successfully started service 'MasterUI' on port 8080.
20/02/15 17:15:46 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://ingest2.adcm.orangecm:8080
20/02/15 17:15:46 INFO Master: I have been elected leader! New state: ALIVE
20/02/15 17:15:51 INFO Master: Registering worker 172.21.113.119:37994 with 32 cores, 61.8 GB RAM
20/02/15 17:19:01 INFO Master: Registering worker 172.21.113.120:44767 with 32 cores, 61.8 GB RAM
20/02/15 17:19:39 INFO Master: Registering app org.apache.spark.ui.DeltaPipeline
20/02/15 17:19:39 INFO Master: Registered app org.apache.spark.ui.DeltaPipeline with ID app-20200215171939-0000
20/02/15 17:19:39 INFO Master: Launching executor app-20200215171939-0000/0 on worker worker-20200215171550-172.21.113.119-37994
20/02/15 17:19:40 INFO Master: Launching executor app-20200215171939-0000/1 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:19:43 INFO Master: Removing executor app-20200215171939-0000/1 because it is EXITED
20/02/15 17:19:43 INFO Master: Launching executor app-20200215171939-0000/2 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:19:47 INFO Master: Removing executor app-20200215171939-0000/2 because it is EXITED
20/02/15 17:19:47 INFO Master: Launching executor app-20200215171939-0000/3 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:19:50 INFO Master: Removing executor app-20200215171939-0000/3 because it is EXITED
20/02/15 17:19:50 INFO Master: Launching executor app-20200215171939-0000/4 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:19:54 INFO Master: Removing executor app-20200215171939-0000/4 because it is EXITED
20/02/15 17:19:54 INFO Master: Launching executor app-20200215171939-0000/5 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:19:57 INFO Master: Removing executor app-20200215171939-0000/5 because it is EXITED
20/02/15 17:19:57 INFO Master: Launching executor app-20200215171939-0000/6 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:20:00 INFO Master: Removing executor app-20200215171939-0000/6 because it is EXITED
20/02/15 17:20:00 INFO Master: Launching executor app-20200215171939-0000/7 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:20:04 INFO Master: Removing executor app-20200215171939-0000/7 because it is EXITED
20/02/15 17:20:04 INFO Master: Launching executor app-20200215171939-0000/8 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:20:07 INFO Master: Removing executor app-20200215171939-0000/8 because it is EXITED
20/02/15 17:20:07 INFO Master: Launching executor app-20200215171939-0000/9 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:20:10 INFO Master: Removing executor app-20200215171939-0000/9 because it is EXITED
20/02/15 17:20:10 INFO Master: Launching executor app-20200215171939-0000/10 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:20:12 INFO Master: Received unregister request from application app-20200215171939-0000
20/02/15 17:20:12 INFO Master: Removing app app-20200215171939-0000
20/02/15 17:20:12 INFO Master: 172.21.113.119:42912 got disassociated, removing it.
20/02/15 17:20:12 INFO Master: ingest2.adcm.orangecm:33195 got disassociated, removing it.
20/02/15 17:20:12 WARN Master: Got status update for unknown executor app-20200215171939-0000/10
20/02/15 17:20:12 WARN Master: Got status update for unknown executor app-20200215171939-0000/0
20/02/15 17:21:22 INFO Master: Registering app org.apache.spark.ui.DeltaPipeline
20/02/15 17:21:22 INFO Master: Registered app org.apache.spark.ui.DeltaPipeline with ID app-20200215172122-0001
20/02/15 17:21:22 INFO Master: Launching executor app-20200215172122-0001/0 on worker worker-20200215171550-172.21.113.119-37994
20/02/15 17:21:22 INFO Master: Launching executor app-20200215172122-0001/1 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:21:25 INFO Master: Removing executor app-20200215172122-0001/1 because it is EXITED
20/02/15 17:21:25 INFO Master: Launching executor app-20200215172122-0001/2 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:21:29 INFO Master: Removing executor app-20200215172122-0001/2 because it is EXITED
20/02/15 17:21:29 INFO Master: Launching executor app-20200215172122-0001/3 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:21:32 INFO Master: Removing executor app-20200215172122-0001/3 because it is EXITED
20/02/15 17:21:32 INFO Master: Launching executor app-20200215172122-0001/4 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:21:35 INFO Master: Removing executor app-20200215172122-0001/4 because it is EXITED
20/02/15 17:21:35 INFO Master: Launching executor app-20200215172122-0001/5 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:21:38 INFO Master: Removing executor app-20200215172122-0001/5 because it is EXITED
20/02/15 17:21:38 INFO Master: Launching executor app-20200215172122-0001/6 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:21:42 INFO Master: Removing executor app-20200215172122-0001/6 because it is EXITED
20/02/15 17:21:42 INFO Master: Launching executor app-20200215172122-0001/7 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:21:45 INFO Master: Removing executor app-20200215172122-0001/7 because it is EXITED
20/02/15 17:21:45 INFO Master: Launching executor app-20200215172122-0001/8 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:21:48 INFO Master: Removing executor app-20200215172122-0001/8 because it is EXITED
20/02/15 17:21:48 INFO Master: Launching executor app-20200215172122-0001/9 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:21:51 INFO Master: Removing executor app-20200215172122-0001/9 because it is EXITED
20/02/15 17:21:51 INFO Master: Launching executor app-20200215172122-0001/10 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:21:54 INFO Master: Removing executor app-20200215172122-0001/10 because it is EXITED
20/02/15 17:21:54 INFO Master: Launching executor app-20200215172122-0001/11 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:21:58 INFO Master: Removing executor app-20200215172122-0001/11 because it is EXITED
20/02/15 17:21:58 INFO Master: Launching executor app-20200215172122-0001/12 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:01 INFO Master: Removing executor app-20200215172122-0001/12 because it is EXITED
20/02/15 17:22:01 INFO Master: Launching executor app-20200215172122-0001/13 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:04 INFO Master: Removing executor app-20200215172122-0001/13 because it is EXITED
20/02/15 17:22:04 INFO Master: Launching executor app-20200215172122-0001/14 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:08 INFO Master: Removing executor app-20200215172122-0001/14 because it is EXITED
20/02/15 17:22:08 INFO Master: Launching executor app-20200215172122-0001/15 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:11 INFO Master: Removing executor app-20200215172122-0001/15 because it is EXITED
20/02/15 17:22:11 INFO Master: Launching executor app-20200215172122-0001/16 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:15 INFO Master: Removing executor app-20200215172122-0001/16 because it is EXITED
20/02/15 17:22:15 INFO Master: Launching executor app-20200215172122-0001/17 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:18 INFO Master: Removing executor app-20200215172122-0001/17 because it is EXITED
20/02/15 17:22:18 INFO Master: Launching executor app-20200215172122-0001/18 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:23 INFO Master: Removing executor app-20200215172122-0001/18 because it is EXITED
20/02/15 17:22:23 INFO Master: Launching executor app-20200215172122-0001/19 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:27 INFO Master: Removing executor app-20200215172122-0001/19 because it is EXITED
20/02/15 17:22:27 INFO Master: Launching executor app-20200215172122-0001/20 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:31 INFO Master: Removing executor app-20200215172122-0001/20 because it is EXITED
20/02/15 17:22:31 INFO Master: Launching executor app-20200215172122-0001/21 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:34 INFO Master: Removing executor app-20200215172122-0001/21 because it is EXITED
20/02/15 17:22:34 INFO Master: Launching executor app-20200215172122-0001/22 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:38 INFO Master: Removing executor app-20200215172122-0001/22 because it is EXITED
20/02/15 17:22:38 INFO Master: Launching executor app-20200215172122-0001/23 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:42 INFO Master: Removing executor app-20200215172122-0001/23 because it is EXITED
20/02/15 17:22:42 INFO Master: Launching executor app-20200215172122-0001/24 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:46 INFO Master: Removing executor app-20200215172122-0001/24 because it is EXITED
20/02/15 17:22:46 INFO Master: Launching executor app-20200215172122-0001/25 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:49 INFO Master: Removing executor app-20200215172122-0001/25 because it is EXITED
20/02/15 17:22:49 INFO Master: Launching executor app-20200215172122-0001/26 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:53 INFO Master: Removing executor app-20200215172122-0001/26 because it is EXITED
20/02/15 17:22:53 INFO Master: Launching executor app-20200215172122-0001/27 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:22:56 INFO Master: Removing executor app-20200215172122-0001/27 because it is EXITED
20/02/15 17:22:56 INFO Master: Launching executor app-20200215172122-0001/28 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:00 INFO Master: Removing executor app-20200215172122-0001/28 because it is EXITED
20/02/15 17:23:00 INFO Master: Launching executor app-20200215172122-0001/29 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:04 INFO Master: Removing executor app-20200215172122-0001/29 because it is EXITED
20/02/15 17:23:04 INFO Master: Launching executor app-20200215172122-0001/30 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:07 INFO Master: Removing executor app-20200215172122-0001/30 because it is EXITED
20/02/15 17:23:07 INFO Master: Launching executor app-20200215172122-0001/31 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:11 INFO Master: Removing executor app-20200215172122-0001/31 because it is EXITED
20/02/15 17:23:11 INFO Master: Launching executor app-20200215172122-0001/32 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:14 INFO Master: Removing executor app-20200215172122-0001/32 because it is EXITED
20/02/15 17:23:14 INFO Master: Launching executor app-20200215172122-0001/33 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:17 INFO Master: Removing executor app-20200215172122-0001/33 because it is EXITED
20/02/15 17:23:17 INFO Master: Launching executor app-20200215172122-0001/34 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:20 INFO Master: Removing executor app-20200215172122-0001/34 because it is EXITED
20/02/15 17:23:20 INFO Master: Launching executor app-20200215172122-0001/35 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:24 INFO Master: Removing executor app-20200215172122-0001/35 because it is EXITED
20/02/15 17:23:24 INFO Master: Launching executor app-20200215172122-0001/36 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:27 INFO Master: Removing executor app-20200215172122-0001/36 because it is EXITED
20/02/15 17:23:27 INFO Master: Launching executor app-20200215172122-0001/37 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:30 INFO Master: Removing executor app-20200215172122-0001/37 because it is EXITED
20/02/15 17:23:30 INFO Master: Launching executor app-20200215172122-0001/38 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:34 INFO Master: Removing executor app-20200215172122-0001/38 because it is EXITED
20/02/15 17:23:34 INFO Master: Launching executor app-20200215172122-0001/39 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:37 INFO Master: Removing executor app-20200215172122-0001/39 because it is EXITED
20/02/15 17:23:37 INFO Master: Launching executor app-20200215172122-0001/40 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:40 INFO Master: Removing executor app-20200215172122-0001/40 because it is EXITED
20/02/15 17:23:40 INFO Master: Launching executor app-20200215172122-0001/41 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:44 INFO Master: Removing executor app-20200215172122-0001/41 because it is EXITED
20/02/15 17:23:44 INFO Master: Launching executor app-20200215172122-0001/42 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:47 INFO Master: Removing executor app-20200215172122-0001/42 because it is EXITED
20/02/15 17:23:47 INFO Master: Launching executor app-20200215172122-0001/43 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:50 INFO Master: Removing executor app-20200215172122-0001/43 because it is EXITED
20/02/15 17:23:50 INFO Master: Launching executor app-20200215172122-0001/44 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:53 INFO Master: Removing executor app-20200215172122-0001/44 because it is EXITED
20/02/15 17:23:53 INFO Master: Launching executor app-20200215172122-0001/45 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:23:57 INFO Master: Removing executor app-20200215172122-0001/45 because it is EXITED
20/02/15 17:23:57 INFO Master: Launching executor app-20200215172122-0001/46 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:24:00 INFO Master: Removing executor app-20200215172122-0001/46 because it is EXITED
20/02/15 17:24:00 INFO Master: Launching executor app-20200215172122-0001/47 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:24:03 INFO Master: Removing executor app-20200215172122-0001/47 because it is EXITED
20/02/15 17:24:03 INFO Master: Launching executor app-20200215172122-0001/48 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:24:07 INFO Master: Removing executor app-20200215172122-0001/48 because it is EXITED
20/02/15 17:24:07 INFO Master: Launching executor app-20200215172122-0001/49 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:24:10 INFO Master: Removing executor app-20200215172122-0001/49 because it is EXITED
20/02/15 17:24:10 INFO Master: Launching executor app-20200215172122-0001/50 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:24:13 INFO Master: Removing executor app-20200215172122-0001/50 because it is EXITED
20/02/15 17:24:13 INFO Master: Launching executor app-20200215172122-0001/51 on worker worker-20200215171857-172.21.113.120-44767
20/02/15 17:24:14 INFO Master: Received unregister request from application app-20200215172122-0001
20/02/15 17:24:14 INFO Master: Removing app app-20200215172122-0001
20/02/15 17:24:14 WARN Master: Got status update for unknown executor app-20200215172122-0001/51
20/02/15 17:24:14 INFO Master: 172.21.113.119:42924 got disassociated, removing it.
20/02/15 17:24:14 INFO Master: ingest2.adcm.orangecm:33237 got disassociated, removing it.
20/02/15 17:24:14 WARN Master: Got status update for unknown executor app-20200215172122-0001/0

Finally, second executor logs (from $SPARK_HOME/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-172.21.113.1.out)

Spark Command: /usr/lib/jvm/jre-1.8.0-openjdk-1.8.0.131-11.b12.el7.x86_64/bin/java -cp /spark/spark-2.4.4/conf/:/spark/spark-2.4.4/assembly/target/scala-2.11/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://172.21.113.119:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/02/15 17:18:54 INFO Worker: Started daemon with process name: [email protected]
20/02/15 17:18:54 INFO SignalUtils: Registered signal handler for TERM
20/02/15 17:18:54 INFO SignalUtils: Registered signal handler for HUP
20/02/15 17:18:54 INFO SignalUtils: Registered signal handler for INT
20/02/15 17:18:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/02/15 17:18:56 INFO SecurityManager: Changing view acls to: root
20/02/15 17:18:56 INFO SecurityManager: Changing modify acls to: root
20/02/15 17:18:56 INFO SecurityManager: Changing view acls groups to: 
20/02/15 17:18:56 INFO SecurityManager: Changing modify acls groups to: 
20/02/15 17:18:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/02/15 17:18:57 INFO Utils: Successfully started service 'sparkWorker' on port 44767.
20/02/15 17:18:57 INFO Worker: Starting Spark worker 172.21.113.120:44767 with 32 cores, 61.8 GB RAM
20/02/15 17:18:57 INFO Worker: Running Spark version 2.4.4
20/02/15 17:18:57 INFO Worker: Spark home: /spark/spark-2.4.4
20/02/15 17:18:58 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
20/02/15 17:18:58 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://ingest2.adcm.orangecm:8081
20/02/15 17:18:58 INFO Worker: Connecting to master 172.21.113.119:7077...
20/02/15 17:18:58 INFO TransportClientFactory: Successfully created connection to /172.21.113.119:7077 after 73 ms (0 ms spent in bootstraps)
20/02/15 17:18:58 INFO Worker: Successfully registered with master spark://172.21.113.119:7077
20/02/15 17:19:37 INFO Worker: Asked to launch executor app-20200215171939-0000/1 for org.apache.spark.ui.DeltaPipeline
20/02/15 17:19:37 INFO SecurityManager: Changing view acls to: root
20/02/15 17:19:37 INFO SecurityManager: Changing modify acls to: root
20/02/15 17:19:37 INFO SecurityManager: Changing view acls groups to: 
20/02/15 17:19:37 INFO SecurityManager: Changing modify acls groups to: 
20/02/15 17:19:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/02/15 17:19:37 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/jre-1.8.0-openjdk-1.8.0.131-11.b12.el7.x86_64/bin/java" "-cp" "/spark/spark-2.4.4/conf/:/spark/spark-2.4.4/assembly/target/scala-2.11/jars/*" "-Xmx24576M" "-Dspark.driver.port=33195" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:33195" "--executor-id" "1" "--hostname" "172.21.113.120" "--cores" "32" "--app-id" "app-20200215171939-0000" "--worker-url" "spark://[email protected]:44767"
20/02/15 17:19:41 INFO Worker: Executor app-20200215171939-0000/1 finished with state EXITED message Command exited with code 1 exitStatus 1
20/02/15 17:19:41 INFO ExternalShuffleBlockResolver: Clean up non-shuffle files associated with the finished executor 1
20/02/15 17:19:41 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20200215171939-0000, execId=1)

Any idea of what is going wrong ?


Solution

  • I realized that the logs were more detailed on spark ui.

    enter image description here

    The problem was the connectivity of the two nodes. I just disabled the firewall between these nodes