Search code examples
pysparkhadoop-yarn

YARN container failing with error code -104 and 143 in spark job


I am triggering the spark submit job using oozie workflow on cloudera 6.2.1 platform. But YARN container is getting failed with error code -104 & 143. Below is log snippet

Application application_1596360900040_33869 failed 2 times due to AM Container for appattempt_1596360900040_33869_000002 exited with  exitCode: -104
…………………………………………………………………………………………………………………………………………………………
…………………some more logs printing jar dependencies…………………………
………………………………………………………………………………………………………………………………………………………………
1001/lib/hadoop/client/xz-1.6.jar:/opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p3757.1951001/lib/hadoop/client/xz.jar -Xmx8G org.apache.spark.deploy.SparkSubmit --master yarn --deploy-mode client --conf spark.yarn.am.memory=8G --conf spark.driver.memory=8G --conf spark.yarn.am.memoryOverhead=820 --conf spark.driver.memoryOverhead=820 --conf spark.executor.memoryOverhead=3280 --conf spark.sql.broadcastTimeout=3600 --num-executors 4 --executor-cores 8 --executor-memory 16G --principal username --keytab username.keytab main.py
[2020-08-14 05:30:26.153]Container killed on request. Exit code is 143
[2020-08-14 05:30:26.167]Container exited with a non-zero exit code 143.

Spark submit parameters are like below

spark2-submit \
--master yarn \
--deploy-mode client \
--num-executors 4 \
--executor-cores 8 \
--executor-memory 16G \
--driver-memory 8G \
--principal ${user_name} \
--keytab ${user_name}.keytab \
--conf spark.sql.broadcastTimeout=3600 \
--conf spark.executor.memoryOverhead=3280 \
--conf spark.driver.memoryOverhead=820 \
--conf spark.yarn.am.memory=8G \
--conf spark.yarn.am.memoryOverhead=820 \
main.py 

I have tried different combinations for executor, driver and application master memory but all results in same error.


Solution

  • Problem is resolved by changing the deploy-mode from client to cluster. I am triggering the spark job from oozie application. So in client mode, driver will start on oozie JVM. To avoid this, I have set the mode to cluster.