python-2.7 timeout deadlock cherrypy freeze

Cherrypy - random requests hanging then server becomes unresponsive

We have a server built on the following:

CherryPy 3.2.2 for Windows
Python 2.7.1
Mako 0.3.6
SQLAlchemy 0.7.2
pyodbc 2.1.8
various other minor components

It is running on the following platform:

Windows Server 2008
Microsoft SQL Server 2008 Web
Hosted VM
xampp Apache gateway

Over the past few months we have had several instances where the cherypy server has become unresponsive. The first symptom is that ajax requests time out, reloading the page in the browser times out, then eventually Apache will return a 502 because cherrypy will no longer accept any connections. Restarting the python service resolves the problem. Sometimes it will go from timeouts to 502 within 10 minutes, other times it can just keep timing out for over half an hour before we realise and have to restart.

The server can run for weeks with no issues, but then some days the problem can occur 5 times within a few hours.

We have implemented some extra logging of all requests to identify patterns. Firstly it does not appear to be triggered by any one specific type of request, time of day, load, user, etc. However we do occasionally get SQL deadlock errors, and sometimes we get the cherrypy request timeout errors. The SQL deadlocks do not appear to be related as the request cleans up and they do not correlate with the time the server freezes. The request timeouts did occur at a similar time, but I have only seen them a couple of times, they do not happen every time.

Within each entry point of each module we've added the following logging:

    util.write_to_debuglog("ajax", "START", _path)

    # Call AJAX target providing the session object
    _args.update(session=session)
    result=target(**_args)

    util.write_to_debuglog("ajax", "FINISH", _path)

    return result

Within the debug log we also print out the session ID and user name so I can see exactly where the requests are coming from.

Yesterday when it went down, this is what the debug log showed (working backwards):

I restarted the server at 11:59:08
Previous log entry was 11:23:11 with START ajax get_orders User1 (no corrsponding FINISH)
Between 11:23:04 and 11:22:33 were 9 pairs of successful START and FINISH requests by different users and different functions
At 11:22:31 was a START ajax get_orders User2 (no FINISH)
11:22:27 to 11:22:23 4 pairs of START and FINISH, different users and functions
11:22:22 a START ajax do_substitutions User3 (no FINISH)
11:22:19 to 11:15:52 299 pairs of START and FINISH, different users (including User1, User2, and User3) and different functions (including those which hung)
11:15:52 a START ajax get_filling_pack User4 (no FINISH)
11:15:51 to 11:15:45 13 pairs of START and FINISH
11:15:45 a START ajax echo User5 (no FINISH)
11:15:43 to 11:14:56 63 pairs of START and FINISH
11:14:56 a START ajax echo User6 (no FINISH)
11:14:56 to 11:13:40 104 pairs of START and FINISH
11:13:40 a START ajax update_supplies User7 (no FINISH)
11:13:38 to 11:13:17 36 pairs of START and FINISH
11:13:17 a START post set_status User8 (no FINISH)
Then normal activity back to midnight when server scheduled restart

Between 11:23:11 and 11:59:08 if you tried to access the server the browser would eventually time out, whereas other times it has eventually reached a point where you get an immediate 502 bad gateway from Apache. This tells me that cherrypy is still accepting socket connections during this period, but the log shows that the requests are not coming though. Even the cherrypy access log shows nothing.

127.0.0.1 - - [05/Jan/2017:11:23:04] "GET /***/get_joblist HTTP/1.1" 200 18 "" "***/2.0"
127.0.0.1 - - [05/Jan/2017:11:59:09] "GET /***/upload_printer HTTP/1.1" 200 38 "" "***/2.0"

On this particular day there were two request timeouts in the error log file, however this is not common.

    [05/Jan/2017:11:22:04]  Traceback (most recent call last):
      File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 169, in trap
        return func(*args, **kwargs)
      File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 96, in __call__
        return self.nextapp(environ, start_response)
      File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 379, in tail
        return self.response_class(environ, start_response, self.cpapp)
      File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 222, in __init__
        self.run()
      File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 320, in run
        request.run(meth, path, qs, rproto, headers, rfile)
      File "c:\python27\lib\site-packages\cherrypy\_cprequest.py", line 603, in run
        raise cherrypy.TimeoutError()
    TimeoutError

    [05/Jan/2017:11:22:05]  Traceback (most recent call last):
      File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 169, in trap
        return func(*args, **kwargs)
      File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 96, in __call__
        return self.nextapp(environ, start_response)
      File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 379, in tail
        return self.response_class(environ, start_response, self.cpapp)
      File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 222, in __init__
        self.run()
      File "c:\python27\lib\site-packages\cherrypy\_cpwsgi.py", line 320, in run
        request.run(meth, path, qs, rproto, headers, rfile)
      File "c:\python27\lib\site-packages\cherrypy\_cprequest.py", line 603, in run
        raise cherrypy.TimeoutError()
    TimeoutError

We are not changing the response timeout, so this should be default of 5 minutes. Based on this assumption the errors do not correlate to any of the requests that had hung. A timeout at 11:22:04 would imply a request at 11:18:04, but all requests at that time were successful. Whereas the 8 requests that did hang were much older than 5 minutes and never timed out.

Why would one type of request hang for one user but continue to work successfully for other users?

Why would they hang at all if they have been working for days or weeks before?

Why isn't the request timeout cleaning up all of these?

Why is the server reaching a point where it won't take any requests at all? Surely 8 concurrent requests isn't the server maximum?

Why is the server still accepting socket connections but not processing requests?

Any suggestions for how I can diagnose or resolve these issues would be greatly appreciated.

Thanks, Patrick.

Solution

I believe we finally tracked this down.

Under a very specific set of conditions it was executing a query on a separate database cursor called from within a transaction (i.e. the second query wasn't in the transaction). So one connection was holding a table lock in SQL and the second connection was waiting for that lock, but both originated from the same Python thread. There was also no timeout set on the database connections (default is infinite), so it would hold in its own deadlock forever. The system would still work as long as nobody queried the same tables held in the transaction. Eventually when other users/threads tried to access the same locked area of the database they would also be maid to wait forever.

I don't think SQL saw this as a deadlock because it expected one connection to finish the transaction and release the lock. I don't think CherryPy could terminate the thread because the SQL connection had control.