Search code examples
kubernetesairflowkubectl

Pod Failing with all Normal Events - How to Dig Deeper


Problem

I'm trying to deploy a pod, which is failing with an error I can't understand. The pod is run via Airflow to execute a particular task. Airflow shows the pod as failing, without any logs. When I run kubectl describe pod my-pod I get the following output.

What should I do to determine the root cause of the issue?

The failing container section:

  base:
    Container ID:  <ID>
    Image:         <IMAGE>
    Image ID:      <ID>
    Port:          <none>
    Host Port:     <none>
    Command:
      airflow
      run
      /var/airflow/my_dag_name.py
      task_name
      2023-02-20T23:15:00+00:00
      --local
      --pool
      default_pool
      -sd
      /var/airflow/my_dag_name.py
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 20 Feb 2023 20:55:07 -0600
      Finished:     Mon, 20 Feb 2023 20:55:11 -0600
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                1
      ephemeral-storage:  100Gi
      memory:             8Gi
    Requests:
      cpu:                500m
      ephemeral-storage:  1Gi
      memory:             8Gi
    Environment:
      <ENV VARS>
    Mounts:
      <VARIOUS MOUNTS>

The events section (this is complete):

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  58s   default-scheduler  Successfully assigned <TASK> to <IP>
  Normal  Pulled     58s   kubelet            Container image <SIDECAR IMAGE 1> already present on machine
  Normal  Created    57s   kubelet            Created container <SIDECAR CONTAINER 1>
  Normal  Started    57s   kubelet            Started container <SIDECAR CONTAINER 1>
  Normal  Pulling    54s   kubelet            Pulling image <SIDECAR IMAGE 2>
  Normal  Pulled     53s   kubelet            Successfully pulled image <SIDECAR IMAGE 2> in 125.691281ms
  Normal  Created    53s   kubelet            Created container <SIDECAR CONTAINER 2>
  Normal  Started    53s   kubelet            Started container <SIDECAR CONTAINER 2>
  Normal  Pulled     52s   kubelet            Container image <FAILING POD IMAGE> already present on machine
  Normal  Created    52s   kubelet            Created container <FAILING POD CONTAINER>
  Normal  Started    52s   kubelet            Started container <FAILING POD CONTAINER>
  Normal  Pulled     52s   kubelet            Container image <SIDECAR IMAGE 3> already present on machine
  Normal  Created    52s   kubelet            Created container <SIDECAR CONTAINER 3>
  Normal  Started    52s   kubelet            Started container <SIDECAR CONTAINER 3>
  Normal  Pulled     52s   kubelet            Container image <SIDECAR IMAGE 4> already present on machine
  Normal  Created    52s   kubelet            Created container <SIDECAR CONTAINER 4>
  Normal  Started    51s   kubelet            Started container <SIDECAR CONTAINER 4>

Context

The pods use these temporary sidecars to connect to systems / inject information / etc.


Solution

  • In Kubernetes, to diagnose issues with pods ,container exit codes are very helpful. If a pod is unhealthy,the problem can be found by using the below command

    kubectl describe pod [POD_NAME]
    

    you have already provided the output of it which shows the information as follows :

    State: Terminated 
    Reason: Error 
    Exit Code: 1
    

    Since the container is terminated with Exit Code 1, a thorough investigation needs to be done on the container and its applications as it is mainly due to an application error or an invalid reference.

    As a first step as suggested by Harsh Manvar, Please check the logs of a concerned pod by using the below command to retrieve the logs for the first container in the pod.

    kubectl logs <pod-name> -p
    

    -p stands for -previous which means If the pod has been restarted it will return the logs for the previous instance of the pod.

    The logs will reveal the root cause of the exit code 1 and this information can be used to fix the command field in the pod’s YAML file. Once updated, please re-apply it to the cluster with kubectl apply command.

    The above information is derived from the link which is written by James Walker.