c++performance network-programming udp iocp

Using IOCP with UDP?

I'm pretty familiar with what Input/Output Completion Ports are for when it comes to TCP.

But what, if I am for example coding a FPS game, or anything where need for low latency can be a deal breaker - I want immediate response to the player to provide the best playing experience, even at cost of losing some spatial data on the go. It becomes obvious that I should use UDP and aside from sending coordinate updates frequently, I should also implement kind of semi-reliable protocol (afaik TCP induces packet loss in UDP so we should avoid mixing these two) to handle such events like chat messages, or gunshots where packet loss may be crucial.

Let's say I'm aiming at performance which would apply to MMOFPS game that allows to meet hundreds of players in one, persistent world, and aside from fighting with guns, it allows them to communicate through chat messages etc. - something like this actually exists and works well - check out PlanetSide 2.

Many articles there on the net (e.g. these from msdn) say overlapped sockets are the best and IOCP is god-tier concept, but they don't seem to distinguish the cases where we use other protocols than TCP.

So there is almost no reliable information about I/O techniques used when developing such a server, I've looked at this, but the topic seems to be highly controversial, and I've also seen this , but considering discussions in the first link, I don't know if I should follow assumptions of the second one, whether I should use IOCP with UDP at all, and if not, what is the most scalable and efficient I/O concept when it comes to UDP.

Or maybe am I just making another premature optimization and no thinking ahead is required for the moment ?

Thought about posting it on gamedev.stackexchange.com, but this question better applies to general-purpose networking I think.

Solution

I do not recommend using this, but technically the most efficient way to receive UDP datagrams would be to just block in recvfrom (or WSARecvFrom if you will). Of course, you'll need a dedicated thread for that, or not much will happen otherwise while you block.

Other than with TCP, you do not have a connection built into the protocol, and you do not have a stream without defined borders. That means you get the sender's address with every datagram that comes in, and you get a whole message or nothing. Always. No exceptions.
Now, blocking on recvfrom means one context switch to the kernel, and one context switch back when something was received. It won't go any faster by having several overlapped reads in flight either, because only one datagram can arrive on the wire at the same time, which is by far the most limiting factor (CPU time is not the bottleneck!). Using an IOCP means at least 4 context switches, two for the receive and two for the notification. Alternatively, an overlapped receive with completion callback is not much better either, because you must NtTestAlert or SleepEx to run the APC queue, so again you have at least 2 extra context switches (though, it's only +2 for all notifications together, and you might incidentially already sleep anyway).

However:
Using an IOCP and overlapped reads is nevertheless the best way to do it, even if it is not the most efficient one. Completion ports are irrespective from using TCP, they work just fine with UDP, too. As long as you use an overlapped read, it does not matter what protocol you use (or even whether it's network or disk, or some other waitable or alertable kernel object).
It also does not really matter for either latency or CPU load whether you burn a few hundred cycles extra for the completion port. We're talking about "nano" versus "milli" here, a factor of one to one million. On the other hand, completion ports are overall a very comfortable, sound, and efficient system.

You can for example trivially implement logic for resending when you did not receive an ACK in time (which you must do when a form of reliability is desired, UDP does not do it for you), as well as keepalive.
For keepalive, add a waitable timer (maybe firing after 15 or 20 seconds) that you reset every time you receive anything. If your completion port ever tells you that this timer went off, you know the connection is dead.
For resends, you could e.g. set a timeout on GetQueuedCompletionStatus, and every time you wake up find all packets that are more than so-and-so old and have not been ACKed yet.
The entire logic happens in one place, which is very nice. It's versatile, efficient, and hard to do wrong.

You can even have several threads (and, indeed, more threads than your CPU has cores) block on the completion port. Many threads sounds like an unwise design, but it is in fact the best thing to do.

A completion port wakes up to N threads in last-in-first-out order, N being the number of cores unless you tell it to do something different. If any of these threads block, another one is woken to handle outstanding events. This means that in the worst case, an extra thread may be running for a short time, but this is tolerable. In the average case, it keeps processor usage close to 100% as long as there is some work to do and zero otherwise, which is very nice. LIFO waking is favourable for processor caches and keeps switching thread contexts low.

This means you can block and wait for an incoming datagram and handle it (decrypt, decompress, perform logic, read someting from disk, whatever) and another thread will be immediately ready to handle the next datagram that might come in the next microsecond. You can use overlapped disk IO with the same completion port, too. If you have compute work (such as AI) to do that can be split into tasks, you can manually post (PostQueuedCompletionStatus) those on the completion port as well and you have a parallel task scheduler for free. All you have to do is wrap an OVERLAPPED into a structure that has some extra data after it, and use a key that you will recognize. No worrying about thread synchronization, it just magically works (you don't even strictly need to have an OVERLAPPED in your custom structure when posting your own notifications, it will work with any structure you pass, but I don't like lying to the operating system, you never know...).

It does not even matter much whether you block, for example when reading from disk. Sometimes this just happens and you can't help it. So what, one thread blocks, but your system still receives messages and reacts to it! The completion port automatically pulls another thread from its pool when it's necessary.

About TCP inducing packet loss on UDP, this is something that I am inclined to call an urban myth (although it is somewhat correct). The way this common mantra is worded is however misleading. It may have been true once upon a time (there exists research on that matter, which is, however, close to a decade old) that routers would drop UDP in favour of TCP, thereby inducing packet loss. That is, however, certainly not the case nowadays.
A more truthful point of view is that anything you send induces packet loss. TCP induces packet loss on TCP and UDP induces packet loss on TCP and vice versa, this is a normal condition (it's how TCP implements congestion control, by the way). A router will generally forward one incoming packet if the cable on the other plug is "silent", it will queue a few packets with a hard deadline (buffers are often deliberately small), optionally it may apply some form of QoS, and it will simply and silently drop everything else.
A lot of applications with rather harsh realtime requirements (VoIP, video streaming, you name it) nowadays use UDP, and while they cope well with a lost packet or two, they do not at all like significant, recurring packet loss. Still, they demonstrably work fine on networks that have a lot of TCP traffic. My phone (like the phones of millions of people) works exclusively over VoIP, data going over the same router as internet traffic. There is no way I can provoke a dropout with TCP, no matter how hard I try.
From that everyday observation, one can tell for certain that UDP is definitively not dropped in favour of TCP. If anything, QoS might favour UDP over TCP, but it most certainly doesn't penaltize it.
Otherwise, services like VoIP would stutter as soon as you open a website and be unavailable alltogether if you download something the size of a DVD ISO file.

EDIT:
To give somewhat of an idea of how simple life with IOCP can be (somewhat stripped down, utility functions missing):

for(;;)
{
    if(GetQueuedCompletionStatus(iocp, &n, &k, (OVERLAPPED**)&o, 100) == 0)
    {
        if(o == 0) // ---> timeout, mark and sweep
        {
            CheckAndResendMarkedDgrams();  // resend those from last pass
            MarkUnackedDgrams();           // mark new ones
        } 
        else
        {   // zero return value but lpOverlapped is not null:
            // this means an error occurred
            HandleError(k, o);
        }
        continue;
    }

    if(n == 0 && k == 0 && o == 0)
    {
        // zero size and zero handle is my termination message
        // re-post, then break, so all threads on the IOCP will
        // one by one wake up and exit in a controlled manner
        PostQueuedCompletionStatus(iocp, 0, 0, 0);
        break;
    }
    else if(n == -1) // my magic value for "execute user task"
    {
        TaskStruct *t = (TaskStruct*)o;
        t->funcptr(t->arg);
    }
    else
    {
        /* received data or finished file I/O, do whatever you do */
    }
}

Note how the entire logic for both handling completion messages, user tasks, and thread control happens in one simple loop, no obscure stuff, no complicated paths, every thread only executes this same, identical loop.
The same code works for 1 thread serving 1 socket, or for 16 threads out of a pool of 50 serving 5,000 sockets, 10 overlapped file transfers, and executing parallel computations.