Search code examples
c++windowswinsockwinsock2

Windows: Event-based Overlapped IO vs IO Completion Ports, Real World Performance


So I've been looking into overlapped IO for sockets for a server application I'm building, and I keep seeing comments of people saying "never use hEvent" or "IO completion ports will be faster", etc, but no one ever says WHY not to use hEvent and no one ever provides any real-world data or numbers on completion ports being faster, or how much faster. hEvent with WaitForMultipleObjects() fits better into my application, so if the speed difference is marginal I'm inclined to use that, but I don't want to commit to that without some real data telling me how big of a sacrifice I'm making there. I've googled and googled and googled and can't find any benchmarks or articles or ANYTHING comparing the two strategies aside from a few StackOverflow answers saying "don't use this one" without giving a reason.

Can anyone provide me with some real information or numbers here on the practical, real world difference between using hEvent and completion ports?


Solution

  • This answer originates from Harry Johnston as a comment on the question, and with a bit of searching I found some more details that make WaitForMultipleObjects a terrifying thing.

    The maximum number of objects you can wait for is 64. That alone makes scalability of the WFMO approach pretty much non-existent. But looking further, I found this thread: https://groups.google.com/forum/#!topic/comp.os.ms-windows.programmer.win32/okwnsYetF6g

    In NT terms, to enter the wait, a wait block has to be allocated for every object, and each waitblock is queued to the object you're waiting for and then cross-linked to the thread. When any of those objects are signalled all those wait blocks have to be dequeued, unlinked, and deallocated back to pool. All of that happens at DISPATCH_LEVEL and all except the pool allocation and free happens with the dispatcher spinlock held.

    (WFMO with fAll == TRUE is even MORE expensive. Every time ANY of the objects is signalled, all the others have to be checked. This all happens, you guessed it, at DISPATCH_LEVEL with the dispatcher spinlock held.)

    That spinlock at the dispatcher level prevents preemption and timeslicing of threads across the whole system, even with multiple cores. That's terrifying and a good reason to never use WFMO for anything ever if you're waiting for more than 3 objects (the thread has 3 wait blocks pre-allocated and can avoid a lot of that if you're waiting for 3 or fewer).