Search code examples
python-asyncioaiohttp

aiohttp Error Rate Increases with Number of Connections


I am trying to get the status code from millions of different sites, I am using asyncio and aiohttp, I run the below code with a different number of connections (yet same timeout on the request) but get very different results specifically much higher number of the following exception.

'concurrent.futures._base.TimeoutError'

The code

import pandas as pd
import asyncio
import aiohttp

out = []
CONNECTIONS = 1000
TIMEOUT = 10

async def fetch(url, session, loop):
    try:
        async with session.get(url,timeout=TIMEOUT) as response:
            res = response.status
            out.append(res)
            return res
    except Exception as e:
        _exception = 'Error: '+str(type(e))
        out.append(_exception)
        return _exception

async def bound_fetch(sem, url, session, loop):
    async with sem:
        await fetch(url, session, loop)

async def run(urls, loop):
    tasks = []
    sem = asyncio.Semaphore(value=CONNECTIONS,loop=loop)
    _connector = aiohttp.TCPConnector(limit=CONNECTIONS, loop=loop)
    async with aiohttp.ClientSession(connector=_connector,loop=loop) as session:
        for url in urls:
            task = asyncio.ensure_future(bound_fetch(sem, url, session, loop))
            tasks.append(task)
        responses = await asyncio.gather(*tasks,return_exceptions=True)
        return responses

## BEGIN ##

tlds = open('data/sample_1k.txt').read().splitlines()
urls = ['http://{}'.format(x) for x in tlds[1:]]

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(urls,loop))
ans = loop.run_until_complete(future)

print(str(pd.Series(out).value_counts()))

Results

CONNECTIONS=1000

enter image description here

CONNECTIONS=100

enter image description here

Is this a bug? These sites do response with a status code and run sequentially or with lower connections there is no timeout error so why is this happening? The other exceptions seem stable as you change number of connections. The ClientOSErrors are from sites that actually timeout or respond, honestly don't really know where the concurrent.futures._base.TimeoutError errors are coming from.


Solution

  • Imagine you opened 1000 urls in browser simultaneously. I bet you'll notice many of them aren't loaded after 10 seconds. It's not a bug it's a limit of your machine resources.

    More parallel requests you're doing -> less network capacity for each one, less CPU time for each one, less RAM for each one -> higher chances each request wouldn't be ready before it's timeout.

    If you see there are many timeouts with 1000 connections, make less connections (and may be increase timeout). Based on aiohttp documentation using different ClientSession instancies may also help:

    Unless you are connecting to a large, unknown number of different servers over the lifetime of your application, it is suggested you use a single session for the lifetime of your application