Search code examples
network-programmingtcpserverlatencylibevent

Extremely high latency when network gets busy, TCP, libevent


In our C/S based online game project, we use TCP for network transmission. We include Libevent, utilise a bufferevent for each connection to handling with the network I/O automatically.

It works well before,but the lagging problem comes to the surface recently. When I do some stress testing to make the network busier, the latency becomes extremely high, several seconds or more. The server sinks into a confusing state:

  • the average CPU usage decreased (0%-60%-0%-60% repeat, waiting something?)
  • the net traffic decreased (nethogs)
  • the clients connected to server still alive (netstat & tcpdump)

It looks like something magically slowed all system down, but new connection to server responded quit in time.

When I changed the protocol to UDP, it works well on the same situation: no obvious latency, the system runs fast. Net traffic is around 3M/S.

The project is running on an Intranet. I also tested the max download speed, nearly 18M/S.

I studied part of Libevent's header files and ducumentations, tried to setup a rate limit to all connections. It did some improvements, but not completely resolved the problem even though I had tried several different configurations. Here is my parameters: read_rate 163840, read_burst 163840, write_rate 163840, write_burst 163840, tick_len 500ms.

Thank you for your help!


Solution

  • TCP = Transmission Control Protocol. It responds to packet loss by retransmitting unacknowledged packets after a delay. In the case of repeated loss, it will exponentially back off. Take a look at this network capture of an attempt to open a connection to host that is not responding:

    enter image description here

    It sends the initial SYN, and then after not getting an ack for 1s it tries again. After not getting an ack it then sends another after ~2s, then ~4s, then ~8s, and so on. So you can see that you can get some serious latency in the face of repeated packet loss.

    Since you said you were deliberately stressing the network, and that the CPU usage is inconsistent, one possible explanation is that TCP is waiting to retransmit lost packets.

    The best way to see what is going on is to get a network capture of what is actually transmitted. If your hosts are connected to a single switch, you can "span" a port of interest to the port of another host where you can make the capture.

    If your switch isn't capable of this, or if you don't have the administrative control of the switch, then you will have to get the capture from one of hosts involved in your online game. The disadvantage of this is that taking the capture will possibly alter what happens, and it doesn't see what is actually on the wire. For example, you might have TCP segmentation offload enabled for your interface, in which case the capture will see large packets that will be broken up by the network interface.

    I would suggest installing wireshark to analyse the network capture (which you can do in real time by using wireshark to do the capture as well). Any time you are working with a networked system I would recommend using wireshark so that you have some visibility into what is actually happening on the network. The first filter I would suggest you use is tcp.analysis.flags which will show you packets suggestive of problems.

    I would also suggest turning off the rate limiting first to try to see what is going on (rate limiting is adding another reason to not send packets, which is probably going to make it harder to diagnose what is going on). Also, 500ms might be a longish tick_len depending on how your game operates. If your burst configuration allows the rate to be used up in 100ms, you will end up waiting 400ms before you can transmit again. The IO Graph is a very helpful feature of Wireshark in this regard. It can help you see transmission rates, although the default tick interval and unit are not very helpful in this regard. Here is an example of a bursty flow being rate limited to 200mbit/s:

    enter image description here

    Note that the tick interval is 1ms and the unit is bits/tick, which makes the top of the chart 1gb/s, the speed of the interface in question.