We make heavy use of multicasting messaging across many Linux servers on a LAN. We are seeing a lot of delays. We basically send an enormous number of small packages. We are more concerned with latency than throughput. The machines are all modern, multi-core (at least four, generally eight, 16 if you count hyperthreading) machines, always with a load of 2.0 or less, usually with a load less than 1.0. The networking hardware is also under 50% capacity.
The delays we see look like queueing delays: the packets will quickly start increasing in latency, until it looks like they jam up, then return back to normal.
The messaging structure is basically this: in the "sending thread", pull messages from a queue, add a timestamp (using gettimeofday()
), then call send()
. The receiving program receives the message, timestamps the receive time, and pushes it in a queue. In a separate thread, the queue is processed, analyzing the difference between sending and receiving timestamps. (Note that our internal queues are not part of the problem, since the timestamps are added outside of our internal queuing.)
We don't really know where to start looking for an answer to this problem. We're not familiar with Linux internals. Our suspicion is that the kernel is queuing or buffering the packets, either on the send side or the receive side (or both). But we don't know how to track this down and trace it.
For what it's worth, we're using CentOS 4.x (RHEL kernel 2.6.9).
Packets can queue up in the send and receive side kernel, the NIC and the networking infrastructure. You will find a plethora of items you can test and tweak.
For the NIC you can usually find interrupt coalescing parameters - how long the NIC will wait before notifying the kernel or sending to the wire whilst waiting to batch packets.
For Linux you have the send and receive "buffers", the larger they are the more likely you are to experience higher latency as packets get handled in batched operations.
For the architecture and Linux version you have to be aware of how expensive context switches are and whether there are locks or pre-emptive scheduling enabled. Consider minimizing the number of applications running, using process affinity to lock processes to particular cores.
Don't forget timing, the Linux kernel version you are using has pretty terrible accuracy on the gettimeofday()
clock (2-4ms) and is quite an expensive call. Consider using alternatives such as reading from the core TSC or an external HPET device.
Diagram from Intel: alt text http://www.theinquirer.net/IMG/142/96142/latency-580x358.png?1272514422