Search code examples
kubernetesairflowkubernetes-helmkubectl

Airflow Helm Chart Worker Node Error - CrashLoopBackOff


I am using official Helm chart for airflow. Every Pod works properly except Worker node.

Even in that worker node, 2 of the containers (git-sync and worker-log-groomer) works fine.

The error happened in the 3rd container (worker) with CrashLoopBackOff. Exit code status as 137 OOMkilled.

In my openshift, memory usage is showing to be at 70%.

Although this error comes because of memory leak. This doesn't happen to be the case for this one. Please help, I have been going on in this one for a week now.

Kubectl describe pod airflow-worker-0 ->

worker:
    Container ID:  <>
    Image:         <>
    Image ID:     <>
    Port:          <>
    Host Port:     <>
    Args:
      bash
      -c
      exec \
      airflow celery worker
    State:          Running
      Started:      <>
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      <>
      Finished:     <>
    Ready:          True
    Restart Count:  3
    Limits:
      ephemeral-storage:  30G
      memory:             1Gi
    Requests:
      cpu:                50m
      ephemeral-storage:  100M
      memory:             409Mi
    Environment:
      DUMB_INIT_SETSID:                        0
      AIRFLOW__CORE__FERNET_KEY:               <>                     Optional: false
    Mounts:
      <>
  git-sync:
    Container ID:   <>
    Image:          <>
    Image ID:       <>
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      <>
    Ready:          True
    Restart Count:  0
    Limits:
      ephemeral-storage:  30G
      memory:             1Gi
    Requests:
      cpu:                50m
      ephemeral-storage:  100M
      memory:             409Mi
    Environment:
      GIT_SYNC_REV:                HEAD
    Mounts:
      <>
  worker-log-groomer:
    Container ID:  <>
    Image:         <>
    Image ID:      <>
    Port:          <none>
    Host Port:     <none>
    Args:
      bash
      /clean-logs
    State:          Running
      Started:      <>
    Ready:          True
    Restart Count:  0
    Limits:
      ephemeral-storage:  30G
      memory:             1Gi
    Requests:
      cpu:                50m
      ephemeral-storage:  100M
      memory:             409Mi
    Environment:
      AIRFLOW__LOG_RETENTION_DAYS:  5
    Mounts:
      <>

I am pretty much sure you know the answer. Read all your articles on airflow. Thank you :) https://stackoverflow.com/users/1376561/marc-lamberti


Solution

  • The issues occurs due to placing a limit in "resources" under helm chart - values.yaml in any of the pods.

    By default it is -

    resources: {}
    

    but this causes an issue as pods can access unlimited memory as required.

    By changing it to -

    resources:
       limits:
        cpu: 200m
        memory: 2Gi
       requests:
        cpu: 100m
        memory: 512Mi
    

    It makes the pod clear on how much it can access and request. This solved my issue.