Search code examples
apache-sparkkubernetescluster-mode

Spark Pod restarting every hour in Kubernetes


I have deployed spark applications in cluster-mode in kubernetes. The spark application pod is getting restarted almost every hour. The driver log has this message before restart:

20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 1 on x.x.x.x: The executor with id 1 was deleted by a user or the framework.
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 2 on y.y.y.y: The executor with id 2 was deleted by a user or the framework.
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 1 (epoch 0)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, x.x.x.x, 44879, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 0)
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 2 (epoch 1)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, y.y.y.y, 46191, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 1)
20/07/11 13:34:02 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
20/07/11 13:34:16 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.

And the Executor log has:

20/07/11 15:55:01 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
20/07/11 15:55:01 INFO MemoryStore: MemoryStore cleared
20/07/11 15:55:01 INFO BlockManager: BlockManager stopped
20/07/11 15:55:01 INFO ShutdownHookManager: Shutdown hook called

How can I find what's causing the executors deletion?

Deployment:

Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  1 max unavailable, 0 max surge
Pod Template:
  Labels:       app=test
                chart=test-2.0.0
                heritage=Tiller
                product=testp
                release=test
                service=test-spark
  Containers:
   test-spark:
    Image:     test-spark:2df66df06c
    Port:       <none>
    Host Port:  <none>
    Command:
      /spark/bin/start-spark.sh
    Args:
      while true; do sleep 30; done;
    Limits:
      memory:  4Gi
    Requests:
      memory:  4Gi
    Liveness:  exec [/spark/bin/liveness-probe.sh] delay=300s timeout=1s period=30s #success=1 #failure=10
    Environment:
      JVM_ARGS:                             -Xms256m -Xmx1g
      KUBERNETES_MASTER:                    https://kubernetes.default.svc
      KUBERNETES_NAMESPACE:                 test-spark
      IMAGE_PULL_POLICY:                    Always
      DRIVER_CPU:                           1
      DRIVER_MEMORY:                        2048m
      EXECUTOR_CPU:                         1
      EXECUTOR_MEMORY:                      2048m
      EXECUTOR_INSTANCES:                   2
      KAFKA_ADVERTISED_HOST_NAME:           kafka.default:9092
      ENRICH_KAFKA_ENRICHED_EVENTS_TOPICS:  test-events
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   test-spark-5c5997b459 (1/1 replicas created)
Events:          <none>

Solution

  • I did a quick research on running Spark on Kubernetes, and it seems that Spark by design will terminate executor pod when they finished running Spark applications. Quoted from the official Spark website:

    When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.

    Therefore, I believe there is nothing to worry about the restarts as long as your Spark instance still manages to start executor pods as and when required.

    Reference: https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#how-it-works