Search code examples
kubernetescalicostrimzi

Temporary failure in name resolution for pod running on kubeadm worker node


I run Kafka inside Kubernetes cluster on VMWare with a ControlPlane and one worker node. From the ControlPlane node my client can communicate with Kafka, but from my worker node this ends up in this error

   %3|1638529687.405|FAIL|apollo-prototype-765f4d8bcf-bjpf4#producer-2| [thrd:sasl_plaintext://my-cluster-kafka-bootstrap:9092/bootstrap]: sasl_plaintext://my-cluster-kafka-bootstrap:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap:9092': Temporary failure in name resolution (after 20016ms in state CONNECT, 2 identical error(s) suppressed)
   %3|1638529687.406|ERROR|apollo-prototype-765f4d8bcf-bjpf4#producer-2| [thrd:app]: apollo-prototype-765f4d8bcf-bjpf4#producer-2: sasl_plaintext://my-cluster-kafka-bootstrap:9092/bootstrap: Failed to resolve 'my-cluster-kafka-bootstrap:9092': Temporary failure in name resolution (after 20016ms in state CONNECT, 2 identical error(s) suppressed)

This is my Kafka cluster manifest (using Strimzi)

listeners:
  - name: plain
    port: 9092
    type: internal
    tls: false
    authentication:
      type: scram-sha-512
  - name: external
    port: 9094
    type: ingress
    tls: true
    authentication:
      type: scram-sha-512
    configuration:
      class: nginx
      bootstrap:
        host: localb.kafka.xxx.com
      brokers:
      - broker: 0
        host: local.kafka.xxx.com

To be mentioned that exactly the same config, when i run inside in a cloud works flawleslly.

Telnet and nslookup (from both nodes) throw up an error. CoreDNS logs do not even mention this error. Also firewall is disable on both nodes.

Could you please help me out? Thanks!


UPDATE: SOLUTION Calico Pod (from the worker node) was complaining that bird: Netlink: Network is down, even it was not crashing

2021-12-03 09:39:58.051 [INFO][90] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.051 [INFO][90] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
2021-12-03 09:39:58.052 [INFO][90] felix/ipsets.go 130: Queueing IP set for creation family="inet" setID="this-host" setType="hash:ip"
2021-12-03 09:39:58.057 [INFO][90] felix/ipsets.go 785: Doing full IP set rewrite family="inet" numMembersInPendingReplace=3 setID="this-host"
2021-12-03 09:39:58.059 [INFO][90] felix/int_dataplane.go 1036: Linux interface state changed. ifIndex=13 ifaceName="tunl0" state="down"
2021-12-03 09:39:58.082 [INFO][90] felix/int_dataplane.go 1521: Received interface update msg=&intdataplane.ifaceUpdate{Name:"tunl0", State:"down", Index:13}
bird: Netlink: Network is down

Here is what I have done and it worked like a charm!

The fault is caused by the different ipvs modules loaded by the node. I configured the ipip module for the new node, but the old node did not load the ipip module, which caused the calico exception. Delete the ipip module to return to normal.

[root@k8s-node236-232 ~]# lsmod  | grep ipip
ipip                   16384  0 
tunnel4                16384  1 ipip
ip_tunnel              24576  1 ipip
[root@k8s-node236-232 ~]# modprobe -r ipip
[root@k8s-node236-232 ~]# lsmod  | grep ipip

Solution

  • Calico Pod (from the worker node) was complaining that bird: Netlink: Network is down, even it was not crashing

    2021-12-03 09:39:58.051 [INFO][90] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
    2021-12-03 09:39:58.051 [INFO][90] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{}}
    2021-12-03 09:39:58.052 [INFO][90] felix/ipsets.go 130: Queueing IP set for creation family="inet" setID="this-host" setType="hash:ip"
    2021-12-03 09:39:58.057 [INFO][90] felix/ipsets.go 785: Doing full IP set rewrite family="inet" numMembersInPendingReplace=3 setID="this-host"
    2021-12-03 09:39:58.059 [INFO][90] felix/int_dataplane.go 1036: Linux interface state changed. ifIndex=13 ifaceName="tunl0" state="down"
    2021-12-03 09:39:58.082 [INFO][90] felix/int_dataplane.go 1521: Received interface update msg=&intdataplane.ifaceUpdate{Name:"tunl0", State:"down", Index:13}
    bird: Netlink: Network is down
    

    Here is what I have done and it worked like a charm!

    The fault is caused by the different ipvs modules loaded by the node. I configured the ipip module for the new node, but the old node did not load the ipip module, which caused the calico exception. Delete the ipip module to return to normal.

    [root@k8s-node236-232 ~]# lsmod  | grep ipip
    ipip                   16384  0 
    tunnel4                16384  1 ipip
    ip_tunnel              24576  1 ipip
    [root@k8s-node236-232 ~]# modprobe -r ipip
    [root@k8s-node236-232 ~]# lsmod  | grep ipip