Spark job `Container killed on request. Exit code is 137` on Dataproc 2.0

My Spark job on Dataproc 2.0 failed. In the driver log, there were many

ExecutorLostFailure (executor 45 exited caused by one of the running tasks)
Reason: Container from a bad node: ... Container killed on request. Exit code is 137

and

23/11/25 10:38:30 ERROR org.apache.spark.network.server.TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
        at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:176)
        at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150)

What could be the possible causes? How do I fix it?

Solution

The exception org.apache.spark.SparkException: Could not find CoarseGrainedScheduler usually indicates the Spark executor failed or was terminated unexpectedly, and the error Container killed on request. Exit code is 137 usually indicates the executor YARN container was killed by Earlyoom (which is available in 2.0+).

When the worker node as a whole is under memory pressure, Earlyoom will be triggered to select and kill processes to release memory to avoid the node to become unhealthy, and YARN containers are often selected. It can be confirmed in /var/log/earlyoom.log or in Cloud Logging with

resource.type="cloud_dataproc_cluster"
resource.labels.cluster_name=...
resource.labels.cluster_uuid=...
earlyoom

You might see logs like

process is killed due to memory pressure. /usr/lib/jvm/.../java ... org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.

In this case, you need to reduce memory pressure for the node, either reduce yarn.nodemanager.resource.memory-mb so there are more space left for other processes, or increase the worker node memory size.

Note that Container killed on request. Exit code is 137 is usually NOT an indicator of OOM with the container itself. If the container itself is OOM, there should be errors like Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. In this case, you might want to consider increasing the spark executor memory and/or memoryOverhead.