Search code examples
flasktimeoutgunicorn

Is it possible to get a stack trace when a gunicorn worker hits a timeout?


Specifically, I'm running a Flask app with default workers in gunicorn. I'm trying to figure out how to debug / trace what is happening when a worker is killed due to timeout while serving a request. Is there a way to get a stack trace or profile the request to debug this?


Solution

  • Honestly a very good question, I was not sure if it is even possible to achieve the same or not. But as I started digging I found lot of interesting threads

    Showing the stack trace from a running Python application

    Get stacktrace from stuck python process

    https://github.com/khamidou/lptrace

    https://gist.github.com/reywood/e221c4061bbf2eccea885c9b2e4ef496

    So first I created a simple flask app with below code

    app.py

    from flask import Flask
    app = Flask(__name__)
    import time
    
    def a():
      b()
    
    def b():
      c ()
    
    def c():
       i = 0
       while i < 900:
           time.sleep(1)
           i += 1
    
    @app.route('/', defaults={'path': ''})
    @app.route('/<path:path>')
    def catch_all(path):
        a()
        return 'You want path: %s' % path
    
    if __name__ == '__main__':
        app.run()
    

    wsgi.py

    from app import app
    
    if __name__ == "__main__":
        app.run()
    

    Now running the app like below and doing curl localhost:8000/abc

    $ gunicorn wsgi:app
    [2019-08-01 08:19:06 +0000] [26825] [INFO] Starting gunicorn 19.9.0
    [2019-08-01 08:19:06 +0000] [26825] [INFO] Listening at: http://127.0.0.1:8000 (26825)
    [2019-08-01 08:19:06 +0000] [26825] [INFO] Using worker: sync
    [2019-08-01 08:19:06 +0000] [26828] [INFO] Booting worker with pid: 26828
    [2019-08-01 08:19:40 +0000] [26825] [CRITICAL] WORKER TIMEOUT (pid:26828)
    [2019-08-01 08:19:40 +0000] [26828] [INFO] Worker exiting (pid: 26828)
    [2019-08-01 08:19:40 +0000] [26832] [INFO] Booting worker with pid: 26832
    

    Now what we need is a hook which can be called before the worked is killed. gunicorn supports server events in the configuration file

    So now we create a config file

    gunicorn_config.py

    timeout = 3
    
    def worker_abort(worker):
        pid = worker.pid
        print("worker is being killed - {}".format(pid))
    

    And our output is now

    $ gunicorn -c gunicorn_config.py wsgi:app
    [2019-08-01 08:22:17 +0000] [26837] [INFO] Starting gunicorn 19.9.0
    [2019-08-01 08:22:17 +0000] [26837] [INFO] Listening at: http://127.0.0.1:8000 (26837)
    [2019-08-01 08:22:17 +0000] [26837] [INFO] Using worker: sync
    [2019-08-01 08:22:17 +0000] [26840] [INFO] Booting worker with pid: 26840
    [2019-08-01 08:22:22 +0000] [26837] [CRITICAL] WORKER TIMEOUT (pid:26840)
    worker is being killed - 26840
    [2019-08-01 08:22:22 +0000] [26840] [INFO] Worker exiting (pid: 26840)
    [2019-08-01 08:22:22 +0000] [26844] [INFO] Booting worker with pid: 26844
    

    This is good, now we need mix our previous knowledge on pyrasite and this to get the stack. So we updated the config file like below

    gunicorn_config.py

    timeout = 3
    
    __code_dump_stack__ = """
    import sys, traceback
    
    for thread, frame in sys._current_frames().items():
        print('Thread 0x%x' % thread)
        traceback.print_stack(frame)
        print()
    """
    
    def dump_stack_for_process(pid):
        import pyrasite
    
        ipc = pyrasite.PyrasiteIPC(pid)
        ipc.connect()
        print(ipc.cmd(__code_dump_stack__))
        ipc.close()
    
    def worker_abort(worker):
        pid = worker.pid
        print("worker is being killed - {}".format(pid))
        dump_stack_for_process(pid)
    

    And now our output is

    $ [2019-08-01 08:25:29 +0000] [26848] [INFO] Starting gunicorn 19.9.0
    [2019-08-01 08:25:29 +0000] [26848] [INFO] Listening at: http://127.0.0.1:8000 (26848)
    [2019-08-01 08:25:29 +0000] [26848] [INFO] Using worker: sync
    [2019-08-01 08:25:29 +0000] [26851] [INFO] Booting worker with pid: 26851
    [2019-08-01 08:25:38 +0000] [26848] [CRITICAL] WORKER TIMEOUT (pid:26851)
    worker is being killed - 26851
    Thread 0x7ff0a7a4b700
      File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap
        self._bootstrap_inner()
      File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
        self.run()
      File "<string>", line 72, in run
      File "<string>", line 92, in on_command
      File "<string>", line 6, in <module>
    
    Thread 0x7ff0ac512700
      File "/home/vagrant/.local/bin/gunicorn", line 11, in <module>
        sys.exit(run())
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/app/wsgiapp.py", line 61, in run
        WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/app/base.py", line 223, in run
        super(Application, self).run()
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/app/base.py", line 72, in run
        Arbiter(self).run()
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 203, in run
        self.manage_workers()
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 545, in manage_workers
        self.spawn_workers()
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 616, in spawn_workers
        self.spawn_worker()
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
        worker.init_process()
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/base.py", line 134, in init_process
        self.run()
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 124, in run
        self.run_for_one(timeout)
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 68, in run_for_one
        self.accept(listener)
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 30, in accept
        self.handle(listener, client, addr)
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 135, in handle
        self.handle_request(listener, req, client, addr)
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 176, in handle_request
        respiter = self.wsgi(environ, resp.start_response)
      File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 2463, in __call__
        return self.wsgi_app(environ, start_response)
      File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 2446, in wsgi_app
        response = self.full_dispatch_request()
      File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 1949, in full_dispatch_request
        rv = self.dispatch_request()
      File "/home/vagrant/.local/lib/python3.5/site-packages/flask/app.py", line 1935, in dispatch_request
        return self.view_functions[rule.endpoint](**req.view_args)
      File "/home/vagrant/remotedebug/app.py", line 20, in catch_all
        a()
      File "/home/vagrant/remotedebug/app.py", line 6, in a
        b()
      File "/home/vagrant/remotedebug/app.py", line 9, in b
        c ()
      File "/home/vagrant/remotedebug/app.py", line 14, in c
        time.sleep(1)
      File "/home/vagrant/.local/lib/python3.5/site-packages/gunicorn/workers/base.py", line 195, in handle_abort
        self.cfg.worker_abort(self)
      File "gunicorn_config.py", line 23, in worker_abort
        dump_stack_for_process(pid)
      File "gunicorn_config.py", line 17, in dump_stack_for_process
        print(ipc.cmd(__code_dump_stack__))
      File "/home/vagrant/.local/lib/python3.5/site-packages/pyrasite/ipc.py", line 161, in cmd
        return self.recv()
      File "/home/vagrant/.local/lib/python3.5/site-packages/pyrasite/ipc.py", line 174, in recv
        header_data = self.recv_bytes(4)
      File "/home/vagrant/.local/lib/python3.5/site-packages/pyrasite/ipc.py", line 187, in recv_bytes
        chunk = self.sock.recv(n - len(data))
    
    
    [2019-08-01 08:25:38 +0000] [26851] [INFO] Worker exiting (pid: 26851)
    [2019-08-01 08:25:38 +0000] [26862] [INFO] Booting worker with pid: 26862
    

    The stack trace is large but it gives us what we need