python parallel-processing tornado coroutine

run function parallel on python tornado

I'm currently developing in python3 (still beginner) on a tornado framework and I have a function which I would like to run in the background. To be more precise the task of the function is to download a big file (chunk by chunk) and probably do some more things after each chunk is downloaded. But the calling function should not wait for the download-function to complete but should rather continue execution.

Here some code examples:

@gen.coroutine
def dosomethingfunc(self, env):
    print("Do something")

    self.downloadfunc(file_url, target_path) #I don't want to wait here

    print("Do something else")


@gen.coroutine
def downloadfunc(self, file_url, target_path):

    response = urllib.request.urlopen(file_url)
    CHUNK = 16 * 1024

    with open(target_path, 'wb') as f:
        while True:
            chunk = response.read(CHUNK)
            if not chunk:
                break
            f.write(chunk)
            time.sleep(0.1) #do something after a chunk is downloaded - sleep only as example

I've read this answer on stackoverflow https://stackoverflow.com/a/25083098/2492068 and tried use it.

Actually I thought if I use @gen.coroutine but no yield the dosomethingfunc would continue without waiting for downloadfunc to finish. But actually the behaviour is the same (with yield or without) - "Do something else" will only be printed after downloadfunc finished the download.

What I'm missing here?

Solution

To benefit of Tornado's asynchronous there must be yielded a non-blocking function at some point. Since the code of downloadfunc is all blocking, the dosomethingfunc does not get back control until called function is finished.

There are couples issue with your code:

time.sleep is blocking, use tornado.gen.sleep instead,
urllib's urlopen is blocking, use tornado.httpclient.AsyncHTTPClient

So the downloadfunc could look like:

@gen.coroutine
def downloadfunc(self, file_url, target_path):

    client = tornado.httpclient.AsyncHTTPClient()

    # below code will start downloading and
    # give back control to the ioloop while waiting for data
    res = yield client.fetch(file_url)

    with open(target_path, 'wb') as f:
        f.write(res)
        yield tornado.gen.sleep(0.1)

To implement it with streaming (by chunk) support, you might want to do it like this:

# for large files you must increase max_body_size
# because deault body limit in Tornado is set to 100MB

tornado.web.AsyncHTTPClient.configure(None, max_body_size=2*1024**3)

@gen.coroutine
def downloadfunc(self, file_url, target_path):

    client = tornado.httpclient.AsyncHTTPClient()

    # the streaming_callback will be called with received portion of data
    yield client.fetch(file_url, streaming_callback=write_chunk)

def write_chunk(chunk):
    # note the "a" mode, to append to the file
    with open(target_path, 'ab') as f:
        print('chunk %s' % len(chunk))
        f.write(chunk)

Now you can call it in dosomethingfunc without yield and the rest of the function will proceed.

edit

Modifying the chunk size is not supported (exposed), both from server and client side. You may also look at https://groups.google.com/forum/#!topic/python-tornado/K8zerl1JB5o