Search code examples
javaspringkubernetesspring-batchspring-cloud-dataflow

SCDF: Error handling when pod failed to start


I'm working on a service where it will call Spring Cloud Dataflow (SCDF) to spin off a new k8s Pod for Spring Batch job.

Map<String, String> properties = Map.of("testApp.cpu", cpu, "testApp.memory", memory);
LOGGER.info("Create task '{}' with definition '{}'", taskName, taskDefinition);
taskOperations.create(taskName, taskDefinition);

LOGGER.info("Launching task '{}' with properties {} and arguments '{}'", taskName, properties, args);
return taskOperations.launch(taskName, properties, args);

Everything works fine. The problem is, whenever we pull a non-existing image (eg: due to some connection issue), the pod failed to start AND we end up with pending tasks (with NO batch jobs created whatever)

For example, we will have tasks in the table task_execution (SCDF table) with empty end time

task_execution with null end time

But no related jobs in batch_job_execution table.

no related batch job for the task

It seems fine at first since no pod is created, we don't consume any resource. But as the number of "pending jobs" reached 20, we have the famous error:

Cannot launch task testApp. The maximum concurrent task executions is at its limit [20]

I'm trying to find a way to detect that the pod spin-off has failed (and hence we should mark the task as error), but to no avail.

Is there a way to detect if the task launch has failed when that task launch a new k8s pod?

UPDATE

Not sure if it is relevant, we are using SCDF 1.7.3.RELEASE

Describe the failed pod:

Name:                 podname-lp2nyowgmm
Namespace:            my-namespace
Priority:             1000
Priority Class Name:  test-cluster-default
Node:                 some-ip.compute.internal/XX.XXX.XXX.XX
Start Time:           Thu, 14 Jan 2021 18:47:52 +0700
Labels:               role=spring-app
                      spring-app-id=podname-lp2nyowgmm
                      spring-deployment-id=podname-lp2nyowgmm
                      task-name=podname
Annotations:          iam.amazonaws.com/role: arn:aws:iam::XXXXXXXXXXXX:role/svc-XXXX-XXX-XX-XXXX-X-XXX-XXX-XXXXXXXXXXXXXXXXXXXX
                      kubernetes.io/psp: eks.privileged
Status:               Pending
IP:                   XX.XXX.XXX.XXX
IPs:
  IP:  XX.XXX.XXX.XXX
Containers:
  podname-lp2nyowgmm:
    Container ID:
    Image:         image_host:XXX/mysystem/myapp:notExist
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      --spring.datasource.username=postgres
      --spring.cloud.task.name=podname
      --spring.datasource.url=jdbc:postgresql://...
      --spring.datasource.driverClassName=org.postgresql.Driver
      --spring.datasource.password=XXXX
      --fileId=XXXXXXXXXXX
      --spring.application.name=app-name
      --fileName=file_name.csv
      ...
      --spring.cloud.task.executionid=3
    State:          Waiting
      Reason:       ErrImagePull
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  8Gi
    Requests:
      cpu:     2
      memory:  8Gi
    Environment:
      ELASTIC_SEARCH_PORT:               80
      ELASTIC_SEARCH_PROTOCOL:           http
      SPRING_RABBITMQ_PORT:              ${RABBITMQ_SERVICE_PORT}
      ELASTIC_SEARCH_URL:                elasticsearch
      SPRING_PROFILES_ACTIVE:            kubernetes
      CLIENT_SECRET:                     ${CLIENT_SECRET}
      SPRING_RABBITMQ_HOST:              ${RABBITMQ_SERVICE_HOST}
      RELEASE_ENV_NAME:                  QA_TEST
      SPRING_CLOUD_APPLICATION_GUID:     ${HOSTNAME}
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx(ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-xxxxx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-xxxxx
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  3m22s                 default-scheduler  Successfully assigned my-namespace/podname-lp2nyowgmm to some-ip.compute.internal
  Normal   Pulling    103s (x4 over 3m21s)  kubelet            Pulling image "image_host:XXX/mysystem/myapp:notExist"
  Warning  Failed     102s (x4 over 3m19s)  kubelet            Failed to pull image "image_host:XXX/mysystem/myapp:notExist": rpc error: code = Unknown desc = Error response from daemon: manifest for image_host:XXX/mysystem/myapp:notExist not found: manifest unknown: manifest unknown
  Warning  Failed     102s (x4 over 3m19s)  kubelet            Error: ErrImagePull
  Normal   BackOff    88s (x6 over 3m19s)   kubelet            Back-off pulling image "image_host:XXX/mysystem/myapp:notExist"
  Warning  Failed     73s (x7 over 3m19s)   kubelet            Error: ImagePullBackOff

Solution

  • 1.7.3 is a very old release. We just released 2.7. The original logic used the task execution tables instead of the pod status. If the version you are using is subject to that, then it would explain what you are seeing. I strongly recommend an upgrade.