Search code examples
pythonamazon-web-servicesubuntukubernetespowerdns

Timeout Issues inside Kubernetes Cluster Powerdns


I'm running PowerDNS recursor inside my k8s cluster. My python script is on a different pod that is doing rdns to my powerdns rescursor app. I have my hpa Max replica set to 8. However, I do not think the load is the problem here. I'm unsure what to do to resolve this timeout error that I'm getting below. I can increase the replicas to solve the problem temporarily, and then it would happen again.

[ipmetadata][MainThread][source.py][144][WARNING]: dns_error code=12, message=Timeout while contacting DNS servers

It seems like my pods are rejecting incoming traffic therefore it's outputting the dns_error code=12.

Here is part of my script that's running the rdns

        return_value = {
            'rdns': None
        }
        try:
            async for attempt in AsyncRetrying(stop=stop_after_attempt(3)):
                with attempt:
                    try:
                        if ip:
                            result = await self._resolver.query(ip_address(ip).reverse_pointer, 'PTR')
                            return_value['rdns'] = result.name
                        return return_value
                    except DNSError as dns_error:
                        # 1  = DNS server returned answer with no data
                        # 4  = Domain name not found
                        # (seems to just be a failure of rdns lookup no sense in retrying)
                        # 11 = Could not contact DNS servers
                        if int(dns_error.args[0]) in [1, 4, 11]:
                            return return_value
                        LOG.warning('dns_error code=%d, message=%s, ip=%s', dns_error.args[0], dns_error.args[1], ip)
                        raise

        except RetryError as retry_ex:
            inner_exception = retry_ex.last_attempt.exception()
            if isinstance(inner_exception, DNSError):
                # 12 = Timeout while contacting DNS servers
                LOG.error('dns_error code=%d, message=%s, ip=%s', inner_exception.args[0], inner_exception.args[1], ip)
            else:
                LOG.exception('rnds lookup failed')
            return return_value


Solution

  • The error code 12 indicates that the PowerDNS recursor did not receive a response from any of the authoritative servers for the queried domain within the configured timeout. This could be due to network issues, firewall rules, rate limiting, or misconfiguration of the recursor or the authoritative servers.

    Possible solutions

    There are a few things you can try to resolve this timeout error:

    • Check the network connectivity and latency between your python pod and your recursor pod, and between your recursor pod and the authoritative servers. You can use tools like ping, traceroute, or dig to diagnose network problems.
    • Check the firewall rules on your k8s cluster and on the authoritative servers. Make sure they allow UDP and TCP traffic on port 53 for DNS queries and responses. You can use tools like iptables, nftables, or ufw to manage firewall rules.
    • Check the rate limiting settings on your recursor and on the authoritative servers. Rate limiting is a mechanism to prevent denial-of-service attacks or abuse of DNS resources by limiting the number of queries per second from a given source. You can use tools like pdnsutil or pdns_control to configure rate limiting on PowerDNS recursor and authoritative servers.
    • Check the configuration of your recursor and the authoritative servers. Make sure they have the correct IP addresses, domain names, and DNSSEC settings. You can use tools like pdnsutil or pdns_control to manage PowerDNS configuration files and settings.

    Examples

    Here are some examples of how to use the tools mentioned above to troubleshoot the timeout error:

    • To ping the recursor pod from the python pod, you can use the following command:
    import subprocess
    recursor_pod_ip = "10.0.0.1" # replace with the actual IP address of the recursor pod
    ping_result = subprocess.run(["ping", "-c", "4", recursor_pod_ip], capture_output=True)
    print(ping_result.stdout.decode())
    

    This will send four ICMP packets to the recursor pod and print the output. You should see something like this:

    PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
    64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.123 ms
    64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.098 ms
    64 bytes from 10.0.0.1: icmp_seq=3 ttl=64 time=0.102 ms
    64 bytes from 10.0.0.1: icmp_seq=4 ttl=64 time=0.101 ms
    
    --- 10.0.0.1 ping statistics ---
    4 packets transmitted, 4 received, 0% packet loss, time 3060ms
    rtt min/avg/max/mdev = 0.098/0.106/0.123/0.010 ms
    

    This indicates that the network connectivity and latency between the python pod and the recursor pod are good.

    • To traceroute the authoritative server from the recursor pod, you can use the following command:
    kubectl exec -it recursor-pod -- traceroute 8.8.8.8
    

    This will trace the route taken by packets from the recursor pod to the authoritative server at 8.8.8.8 (Google DNS). You should see something like this:

    traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
     1  10.0.0.1 (10.0.0.1)  0.123 ms  0.098 ms  0.102 ms
     2  10.0.1.1 (10.0.1.1)  0.456 ms  0.432 ms  0.419 ms
     3  10.0.2.1 (10.0.2.1)  0.789 ms  0.765 ms  0.752 ms
     4  192.168.0.1 (192.168.0.1)  1.123 ms  1.098 ms  1.085 ms
     5  192.168.1.1 (192.168.1.1)  1.456 ms  1.432 ms  1.419 ms
     6  192.168.2.1 (192.168.2.1)  1.789 ms  1.765 ms  1.752 ms
     7  192.168.3.1 (192.168.3.1)  2.123 ms  2.098 ms  2.085 ms
     8  192.168.4.1 (192.168.4.1)  2.456 ms  2.432 ms  2.419 ms
     9  192.168.5.1 (192.168.5.1)  2.789 ms  2.765 ms  2.752 ms
    10  8.8.8.8 (8.8.8.8)  3.123 ms  3.098 ms  3.085 ms
    

    This indicates that the route to the authoritative server is clear and there are no firewall blocks or network issues.

    • To dig the domain name from the recursor pod, you can use the following command:
    kubectl exec -it recursor-pod -- dig example.com
    

    This will send a DNS query for the domain name example.com to the recursor pod and print the response. You should see something like this:

    ; <<>> DiG 9.11.5-P4-5.1ubuntu2.1-Ubuntu <<>> example.com
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12345
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
    
    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;example.com.           IN  A
    
    ;; ANSWER SECTION:
    example.com.        3600    IN  A   93.184.216.34
    
    ;; Query time: 12 msec
    ;; SERVER: 10.0.0.1#53(10.0.0.1)
    ;; WHEN: Tue Jun 15 12:34:56 UTC 2021
    ;; MSG SIZE  rcvd: 56
    

    This indicates that the recursor pod received a valid response from the authoritative server for the domain name example.com.

    • To check the rate limiting settings on the recursor pod, you can use the following command:
    kubectl exec -it recursor-pod -- pdns_control get-all
    

    This will print all the configuration settings of the recursor pod. You should look for the following settings:

    max-cache-entries=1000000
    max-packetcache-entries=500000
    max-recursion-depth=40
    max-tcp-clients=128
    max-udp-queries-per-round=1000
    max-udp-queries-per-second=10000
    

    These settings control the maximum number of cache entries, TCP clients, UDP queries, and recursion depth that the recursor pod can handle. You can adjust them according to your needs and resources. You can use the following command to set a new value for a setting:

    kubectl exec -it recursor-pod -- pdns_control set max-udp-queries-per-second 20000
    

    This will set the maximum number of UDP queries per second to 20000.

    • To check the configuration of the authoritative server at 8.8.8.8, you can use the following command:
    dig +short CHAOS TXT version.bind @8.8.8.8
    

    This will send a DNS query for the version of the authoritative server at 8.8.8.8. You should see something like this:

    "google-public-dns-a.google.com"
    

    This indicates that the authoritative server is running Google Public DNS, which is a well-known and reliable DNS service. You can check the documentation of Google Public DNS for more information on its configuration and features. You can also use the following command to check the DNSSEC status of the authoritative server:

    dig +short CHAOS TXT id.server @8.8.8.8
    

    This will send a DNS query for the identity of the authoritative server at 8.8.8.8. You should see something like this:

    "edns0"
    

    This indicates that the authoritative server supports EDNS0, which is an extension of the DNS protocol that enables DNSSEC and other features. You can check the documentation of EDNS0 for more information on its functionality and benefits.