Search code examples
pythonpython-2.7curltornadopycurl

HTTPS request Memory leak with CurlAsyncHTTPClient


My handler file

# -*- coding:utf-8 -*-
import sys
from tornado import gen, web, httpclient

url = "https://mdetail.tmall.com/templates/pages/desc?id=527485572414"

class SearchHandler(web.RequestHandler):
    @gen.coroutine
    def get(self):
        async_client = httpclient.AsyncHTTPClient()
        print sys.getrefcount(async_client) # The first time less than 10, then always bigger than 200
        req = httpclient.HTTPRequest(url, "GET", headers=headers)
        req_lists = [async_client.fetch(req) for _ in range(200)]
        r = yield req_lists
        print sys.getrefcount(async_client) # always bigger than 200
        # The longer req_lists, the more memory will be consumed, and never decrease

configure file

tornado.httpclient.AsyncHTTPClient.configure(client, max_clients=1000)

if my client is "tornado.curl_httpclient.CurlAsyncHTTPClient", Then when I visit my handler in broswer, htop shows memory increase about 6GB,as long as the process running, memory usage never decrease

If I set range(200) to range(500) or higher, Memory usage grows higher

if my cline is None, memory barely increase

I found only fetch the https:// will have memory issue

How can I slove the momory problem with CurlAsyncHTTPClient?

Environment:

Ubuntu 16.10 x64
python2.7.12
Tornado 4.5.1

Solution

  • The reference counts you see are expected, because with max_clients=1000, Tornado will cache and reuse 1000 pycurl.Curl instances, each of which may hold a reference to the client’s _curl_header_callback. You can see it with objgraph.show_backrefs.

    Do you really need max_clients=1000 — that is, up to 1000 requests in parallel? (I’m hoping they’re not all to the same server, as in your example!)

    Anyway, the Curl instances seem to be taking up a lot of memory.

    On my system (Ubuntu 16.04), I can reproduce the problem when using PycURL linked against the system-wide libcurl3-gnutls 7.47.0:

    $ /usr/bin/time -v python getter.py 
    6
    207
    ^C
    [...]
        Maximum resident set size (kbytes): 4853544
    

    When I link PycURL with a freshly built libcurl 7.54.1 (still with GnuTLS backend), I get a much better result:

    $ LD_LIBRARY_PATH=$PWD/curl-prefix/lib /usr/bin/time -v python getter.py 
    6
    207
    ^C
    [...]
        Maximum resident set size (kbytes): 1016084
    

    And if I use libcurl with the OpenSSL backend, the result is better still:

        Maximum resident set size (kbytes): 275572
    

    There are other reports of memory problems with GnuTLS: curl issue #1086.

    So, if you do need a large max_clients, try using a newer libcurl built with the OpenSSL backend.