All,
We have a Apache Spark v3.12 + Yarn on AKS (SQLServer 2019 BDC). We ran a refactored python code to Pyspark which resulted in the error below:
Application application_1635264473597_0181 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1635264473597_0181_000001 exited with exitCode: -104
Failing this attempt.Diagnostics: [2021-11-12 15:00:16.915]Container [pid=12990,containerID=container_1635264473597_0181_01_000001] is running 7282688B beyond the 'PHYSICAL' memory limit. Current usage: 2.0 GB of 2 GB physical memory used; 4.9 GB of 4.2 GB virtual memory used. Killing container.
Dump of the process-tree for container_1635264473597_0181_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 13073 12999 12990 12990 (python3) 7333 112 1516236800 235753 /opt/bin/python3 /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp/3677222184783620782
|- 12999 12990 12990 12990 (java) 6266 586 3728748544 289538 /opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1 -Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.livy.rsc.driver.RSCDriverBootstrapper --properties-file /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties
|- 12990 12987 12990 12990 (bash) 0 0 4304896 775 /bin/bash -c /opt/mssql/lib/zulu-jre-8/bin/java -server -XX:ActiveProcessorCount=1 -Xmx1664m -Djava.io.tmpdir=/var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.livy.rsc.driver.RSCDriverBootstrapper' --properties-file /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_conf.properties --dist-cache-conf /var/opt/hadoop/temp/nm-local-dir/usercache/grajee/appcache/application_1635264473597_0181/container_1635264473597_0181_01_000001/spark_conf/spark_dist_cache.properties 1> /var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stdout 2> /var/log/yarnuser/userlogs/application_1635264473597_0181/container_1635264473597_0181_01_000001/stderr
[2021-11-12 15:00:16.921]Container killed on request. Exit code is 143
[2021-11-12 15:00:16.940]Container exited with a non-zero exit code 143.
For more detailed output, check the application tracking page: https://sparkhead-0.mssql-cluster.everestre.net:8090/cluster/app/application_1635264473597_0181 Then click on links to logs of each attempt.
. Failing the application.
The default setting is as below and there are no runtime settings:
"settings": {
"spark-defaults-conf.spark.driver.cores": "1",
"spark-defaults-conf.spark.driver.memory": "1664m",
"spark-defaults-conf.spark.driver.memoryOverhead": "384",
"spark-defaults-conf.spark.executor.instances": "1",
"spark-defaults-conf.spark.executor.cores": "2",
"spark-defaults-conf.spark.executor.memory": "3712m",
"spark-defaults-conf.spark.executor.memoryOverhead": "384",
"yarn-site.yarn.nodemanager.resource.memory-mb": "12288",
"yarn-site.yarn.nodemanager.resource.cpu-vcores": "6",
"yarn-site.yarn.scheduler.maximum-allocation-mb": "12288",
"yarn-site.yarn.scheduler.maximum-allocation-vcores": "6",
"yarn-site.yarn.scheduler.capacity.maximum-am-resource-percent": "0.34".
}
Is the AM Container mentioned the Application Master Container or Application Manager (of YARN). If this is the case, then in a Cluster Mode setting, the Driver and the Application Master run in the same Container?
What runtime parameter do I change to make the Pyspark code successfully.
Thanks,
grajee
Likely you don't change any settings 143 could mean a lot of things, including you ran out of memory. To test if you ran out of memory. I'd reduce the amount of data you are using and see if you code starts to work. If it does it's likely you ran out of memory and should consider refactoring your code. In general I suggest trying code changes first before making spark config changes.
For an understanding of how spark driver works on yarn, here's a reasonable explanation: https://sujithjay.com/spark/with-yarn