Search code examples
pythonkubernetesrabbitmqpika

RabbitMQ clients orphaned in the server because of K8S SIGTERM


We have a bunch of pods that use RabbitMQ. If the pods are shut down by K8S with SIGTERM, we have found that our RMQ client (Python Pika) has no time to close the connection to RMQ Server causing it to think those clients are still alive until 2 heartbeats are missed.

Our investigation has turned up that on SIGTERM, K8S kills all in- and most importantly OUTbound TCP connections, among other things (removing endpoints, etc.) Tried to see if any connections were still possible during preStop hooks, but preStop seems very internally focused and no traffic got out.

Has anybody else experienced this issue and solved it? All we need to do is be able to get a message out the door before kubelet slams the door. Our pods are not K8S "Services" so some suggestions didn't help.

Steps to reproduce:

  1. add preStop hook sleep 30s to Sender pod
  2. tail logs of Receiver pod to see inbound requests
  3. enter Sender container's shell & loop curl Receiver - requests appear in the logs
  4. k delete pod to start termination of Sender pod
  5. curl requests immediately begin to hang in Sender, nothing in the Receiver logs

Solution

  • We tested this extensively and found that new EKS clusters, with Calico installed (see below) will experience this problem, unless Calico is upgraded. Networking will be immediately killed when a pod is sent SIGTERM instead of waiting for the grace period. If you're experiencing this problem and are using Calico, please check the version of Calico against this thread:

    https://github.com/projectcalico/calico/issues/4518

    If you're installing Calico using the AWS yaml found here: https://github.com/aws/amazon-vpc-cni-k8s/tree/master/config

    Be advised that the fixes have NOT landed in any of the released versions, we had to install from master, like so:

      kubectl apply \
      -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/master/config/master/calico-operator.yaml \
      -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/master/config/master/calico-crs.yaml
    

    and we also upgraded the AWS CNI for good measure, although that wasn't explicitly required to solve our issue:

      kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.8.0/config/v1.8/aws-k8s-cni.yaml
      kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.9.1/config/v1.9/aws-k8s-cni.yaml
    

    Here's a bunch of confusing documentation from AWS that makes it seem like you should switch to use new AWS "add-ons" to manage this stuff, but after an extensive discussion with support, was advised against