Search code examples
clinuxapachewatchdogboa

Programmatically detect if local web server has hung


I realise that I'll get at least one answer along the lines of "(re)write the code so it doesn't hang" but let's assume we don't live in that shiny happy utopia just yet...

In our embedded system we have a big SDK including a web-server (Boa) which is the primary method of user interaction.

It's possible, during certain phases of the moon, that something can cause the web server to hang or become otherwise stuck in such a way that the process appears running normally (not crashed/dead/using 100% CPU) but does not serve any web pages.

So, the question is, how do we test/detect this situation?


Solution

  • To test whether the server is hung, create a TCP socket and connect to port 80 on IP address 127.0.0.1 (loopback address). Then send the following text over the socket

    GET / HTTP/1.1\r\n\r\n
    

    Most servers will interpret that as a request for index.html. Alternatively, you could implement an undocumented URL for testing (which allows for a shorter, predetermined response), e.g.

    GET /test/fdoaoqfaf12491r2h1rfda HTTP/1.1\r\n\r\n
    

    You then need to read the response from the server. This involves using select with a reasonable timeout to determine whether any data came back from the server, and if so, use recv to read the data. The response from the server will consist of a header followed by content. The header consists of lines of text, with a blank line at the end of the header. Lines end with \r\n, so the end of the header is \r\n\r\n.

    Getting the content involves calling select and recv until recv returns 0. This assumes that the server will send the response and then close the socket. Some sophisticated servers will leave a socket open to allow multiple requests over the same socket. A simple embedded server should not be doing that. (If your server is trying to use the same socket for multiple requests, then you need to figure out how to turn that feature off.)


    That's all very well and good, but you really need to rewrite your code so it doesn't hang.

    The mostly likely cause of the problem is that the server has a bunch of dangling sockets, i.e. connections from clients that were never properly cleaned up. Dangling sockets will eventually prevent the server from accepting more connections, either because the server has a limit on the number of open connections, or because the process that's running the server uses up all of its file descriptors.

    The first thing to check is the TCP timeout value. One project that I worked on had a default timeout of 5 hours, which meant that dangling sockets stayed open for 5 hours. A reasonable timeout is 1 minute.

    Then you need to create a client that deliberately misbehaves. Clients can misbehave by

    • leaving a socket open without reading the server's response
    • abruptly closing the socket while reading the response
    • gracefully closing the socket while reading the response

    The first situation should be handled by the TCP timeout. The other two need to be properly handled by the server code. Graceful and abrupt socket closure is controlled via the SO_LINGER option of ioctl and the shutdown function. After the client misbehaves, check the number of open file descriptors in the server process, to verify that the server has handled the situation correctly.