Search code examples
cmultithreadingsocketsposix-select

Why select() doesn't respect timeout, especially in multithreading


We have client application that listens for UDP multicast feed and processes incoming data. It's portable and runs both on Windows and Linux. The main processing loop uses select() to wait for data, typically it's one or two UDP non-blocking sockets:

while(!stopRequested)
{
   fdset io;
   FD_ZERO(&io);
   FD_SET(sock, &io);

   timeval waitInterval = { 0 };
   waitInterval.tv_usec = 10000; // 10 milliseconds

   int r = select(sock + 1, &io, NULL, NULL, &waitInterval); 
   if(r == 0) // Process timeout
   else // Data or error processing
}

The code works pretty well but there is problem with timeout accuracy when no data is available. We measured the time that actually is spent inside the select() during several of seconds of guaranteed idle (no data was sent) and the distribution is like this:

<1 usec     : 170 time(s)
<2000 usec  : 1 time(s)
<10000 usec : 11973 time(s)
<12000 usec : 6558 time(s)
<15000 usec : 64 time(s)
<20000 usec : 47 time(s)

There were no errors, select() always returned 0. So as we can see, there are several cases (170 times) when select() returned almost immediately, without waiting for any timeout.

So the question is why timeout is not respected in several cases? Similar results are obtained both for Windows (Win7 x64) and Linux (CentOS/RHEL6.0 x64).

Moreover, things became much worse when multithreading is used. When 2 threads are executing the code above (both calling select() for same socket, but fd_set and waitInterval are local objects), distribution of times inside select() is like this (for each thread):

<1 usec     : 13800827 time(s)
<10 usec    : 1639 time(s)
<100 usec   : 8660 time(s)
<1000 usec  : 16 time(s)
<12000 usec : 768 time(s)
<15000 usec : 39 time(s)

That looks like select() almost never respects timeout but returns 0 immediately in concurrent calls.

Is there any explanation of such confusing behavior? Common pitfalls of not re-initializing fd_setand timeout parameters are checked and this is not the case definitely.


Solution

  • Actually there is nothing wrong with select(), it respects timeout correctly. Measurement was wrong - I measured not the select() call itself, but a wrapping function that actually checked fd_set and skipped select() call if empty. All times <10000 usec aren't related to select() call.

    Probably this question shall be deleted as non-relevant.