I just solved a latency issue in our infrastructure that was triggered because this code snippet here triggered a call to getaddrinfo
on every run of the code:
sock = UDPSocket.open
sock.send("#{key}|#{value}", 0,
GRAPHITE_SERVER,
STATSD_PORT)
sock.close
Because we use statsd and graphite for high-volume event and stats monitoring, we were effectively triggering numerous calls getaddrinfo
on every API call, and potentially tens of thousands every minute.
I modified this code to use the internal IP address, not the DNS name, of our graphite server, and was able to resolve the latency issue (presumably because the internal AWS VPC DNS server was not equipped to handle such a high volume of requests).
Now that my issue is resolved, I would love to know why the UDP implementation in Ruby is not using a cached IP address value (presumably based on the TTL of the domain name entry). Here is the relevant line and the function in full, you can see the call to rsock_addrinfo
just at the end:
static VALUE
udp_send(int argc, VALUE *argv, VALUE sock)
{
VALUE flags, host, port;
struct udp_send_arg arg;
VALUE ret;
if (argc == 2 || argc == 3) {
return rsock_bsock_send(argc, argv, sock);
}
rb_scan_args(argc, argv, "4", &arg.sarg.mesg, &flags, &host, &port);
StringValue(arg.sarg.mesg);
GetOpenFile(sock, arg.fptr);
arg.sarg.fd = arg.fptr->fd;
arg.sarg.flags = NUM2INT(flags);
arg.res = rsock_addrinfo(host, port, rsock_fd_family(arg.fptr->fd), SOCK_DGRAM, 0);
ret = rb_ensure(udp_send_internal, (VALUE)&arg,
rsock_freeaddrinfo, (VALUE)arg.res);
if (!ret) rsock_sys_fail_host_port("sendto(2)", host, port);
return ret;
}
I assume this decision is intentional and would love to learn more about the reasons why.
getaddrinfo
does not return data about the TTL... because it may not have it at all in fact, as the resolution may not necessarily be done over the DNS (could be hosts
file, LDAP, etc. see /etc/nsswitch.conf
)
From its manual here is the structure returned:
int getaddrinfo(const char *hostname, const char *servname, const struct addrinfo *hints, struct addrinfo **res); struct addrinfo { int ai_flags; /* input flags */ int ai_family; /* protocol family for socket */ int ai_socktype; /* socket type */ int ai_protocol; /* protocol for socket */ socklen_t ai_addrlen; /* length of socket-address */ struct sockaddr *ai_addr; /* socket-address for socket */ char *ai_canonname; /* canonical name for service location */ struct addrinfo *ai_next; /* pointer to next in list */ };
After a successful call to getaddrinfo(), *res is a pointer to a linked list of one or more addrinfo structures.
So it is up to the thing "behind" getaddrinfo
to do some caching or not, because getaddrinfo
may have used the DNS to retrieve data, or not.
Some specific API for DNS, like getdnsapi
will give back to the caller some information on the TTL, see https://getdnsapi.net/documentation/spec/ and example 6.2
6·2 Get IPv4 and IPv6 Addresses for a Domain Name
This example is similar to the previous one, except that it retrieves more information than just the addresses, so it traverses the replies_tree. In this case, it gets both the addresses and their TTLs.
Without any cache layer anywhere, since UDP is stateless, any new send
must trigger resolution in some way or form.
You said:
"modified this code to use the internal IP address, not the DNS name"
You should instead install a local (on the box) recursive caching nameserver, such as unbound
. All your local applications will benefit from it, and a faster DNS resolution (depending on how /etc/nsswitch.conf
, /etc/resolv.conf
and /etc/hosts
are setup also).
For the associated bug report hinted by @Casper it seems at its core more an issue about IPv6 vs IPv4 which could be solved either by adjusting /etc/gai.conf
or equivalent or doing some more clever programming around opening the connection, with the so called "happy eyeball algorithm" where you try to resolve both A
and AAAA
at the same time which means two parallel DNS queries (because you can not combine them into one per the protocol) and try to use the fastest one coming back, with a slight preference for AAAA
if you want to be in the modern camp so you would fire the A
one only some given amount of milliseconds after the AAAA
to catch the case where you do not get a reply at all for AAAA
or a negative one. See RFC6555 for details.