Search code examples
pythonpython-3.xpython-asyncioaiohttp

HEAD requests with aiohttp is dog slow


Given a list of 50k websites urls, I've been tasked to find out which of them are up/reachable. The idea is just to send a HEAD request to each URL and look at the status response. From what I hear an asynchronous approach is the way to go and for now I'm using asyncio with aiohttp.

I came up with the following code but the speed is pretty abysmal. 1000 URLs takes approximately 200 seconds on my 10mbit connection. I don't know what speeds to expect but I'm new to asynchronous programming in Python so I figured I've stepped wrong somewhere. As you can see I've tried increasing the number of allowed simultaneous connections to 1000 (up from the default of 100) and the duration for which DNS resolves are kept in the cache; neither to any great effect. The environment has Python 3.6 and aiohttp 3.5.4.

Code review unrelated to the question is also appreciated.

import asyncio
import time
from socket import gaierror
from typing import List, Tuple

import aiohttp
from aiohttp.client_exceptions import TooManyRedirects

# Using a non-default user-agent seems to avoid lots of 403 (Forbidden) errors
HEADERS = {
    'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
                   'AppleWebKit/537.36 (KHTML, like Gecko) '
                   'Chrome/45.0.2454.101 Safari/537.36'),
}


async def get_status_code(session: aiohttp.ClientSession, url: str) -> Tuple[int, str]:
    try:
        # A HEAD request is quicker than a GET request
        resp = await session.head(url, allow_redirects=True, ssl=False, headers=HEADERS)
        async with resp:
            status = resp.status
            reason = resp.reason
        if status == 405:
            # HEAD request not allowed, fall back on GET
            resp = await session.get(
                url, allow_redirects=True, ssl=False, headers=HEADERS)
            async with resp:
                status = resp.status
                reason = resp.reason
        return (status, reason)
    except aiohttp.InvalidURL as e:
        return (900, str(e))
    except aiohttp.ClientConnectorError:
        return (901, "Unreachable")
    except gaierror as e:
        return (902, str(e))
    except aiohttp.ServerDisconnectedError as e:
        return (903, str(e))
    except aiohttp.ClientOSError as e:
        return (904, str(e))
    except TooManyRedirects as e:
        return (905, str(e))
    except aiohttp.ClientResponseError as e:
        return (906, str(e))
    except aiohttp.ServerTimeoutError:
        return (907, "Connection timeout")
    except asyncio.TimeoutError:
        return (908, "Connection timeout")


async def get_status_codes(loop: asyncio.events.AbstractEventLoop, urls: List[str],
                           timeout: int) -> List[Tuple[int, str]]:
    conn = aiohttp.TCPConnector(limit=1000, ttl_dns_cache=300)
    client_timeout = aiohttp.ClientTimeout(connect=timeout)
    async with aiohttp.ClientSession(
            loop=loop, timeout=client_timeout, connector=conn) as session:
        codes = await asyncio.gather(*(get_status_code(session, url) for url in urls))
        return codes


def poll_urls(urls: List[str], timeout=20) -> List[Tuple[int, str]]:
    """
    :param timeout: in seconds
    """
    print("Started polling")
    time1 = time.time()
    loop = asyncio.get_event_loop()
    codes = loop.run_until_complete(get_status_codes(loop, urls, timeout))
    time2 = time.time()
    dt = time2 - time1
    print(f"Polled {len(urls)} websites in {dt:.1f} seconds "
          f"at {len(urls)/dt:.3f} URLs/sec")
    return codes

Solution

  • Right now you're launching all your requests at once. Thus probably bottleneck appeared somewhere. To avoid this situation semaphore can be used:

    # code
    
    sem = asyncio.Semaphore(200)
    
    
    async def get_status_code(session: aiohttp.ClientSession, url: str) -> Tuple[int, str]:
        try:
            async with sem:
                resp = await session.head(url, allow_redirects=True, ssl=False, headers=HEADERS)
                # code
    

    I tested it following way:

    poll_urls([
        'http://httpbin.org/delay/1' 
        for _ 
        in range(2000)
    ])
    

    And got:

    Started polling
    Polled 2000 websites in 13.2 seconds at 151.300 URLs/sec
    

    Although it requests a single host, it shows that asynchronous approach does the job: 13 sec. < 2000 sec.

    Several more things can be done:

    • You should play semaphore value to achieve better performance for your concrete environment and task.

    • Try to lower timeout from 20 to, let's say, 5 seconds: since you're just doing head request it shouldn't take much time. If request hangs for 5 seconds there are good chances it won't be successful at all.

    • Monitoring your system resources (network/CPU/RAM) while script running can help to find out if bottleneck is still present.

    • By the way, did you install aiodns (as doc suggests)?

    • Does disabling ssl change anything?

    • Try to enable debug level of logging to see if there is any useful info there

    • Try to setup client tracing and especially measure time for each request step to see which ones take most time

    It's difficult to say more without fully reproducible situation.