Search code examples
python-3.xpython-asyncio

Strange errors with asynchronous requests


async def rss_downloader(rss):
    global counter
    async with download_limit:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'
        }
        try:
            response = await httpx.get(rss, headers=headers, verify=False)
            if response.status_code == 200:
                r_text = response.text
                await downloaded_rss.put({'url': rss, 'feed': r_text})
            else:
                counter += 1
                print(f'№{counter} - {response.status_code} - {rss}')
        except (
            ConnectTimeout, ConnectionClosed
        ):
            not_found_rss.append(rss)
        except Exception:
            not_found_rss.append(rss)
            logging.exception(f'{rss}')


async def main():
    parser_task = asyncio.create_task(parser_queue())
    tasks = [
        asyncio.create_task(rss_downloader(item['url'])) for item in db[config['mongodb']['channels_collection']].find({'url': {'$ne': 'No RSS'}})
    ]
    await asyncio.gather(*tasks, parser_task)

Very often, this code cannot load some pages, causing various errors. Here is an example of some errors. But when I try to load the same pages one at a time, everything is fine:

In [1]: import httpx

In [2]: r = await httpx.get('http://www.spinmob.com/nirvanictrance.xml')

In [3]: r
Out[3]: <Response [200 OK]>

In [4]:

As a semaphore, I set a limit of 20 workers, which is not so much, I tried less and less - all the same, these errors appear. Why can this happen and what can I do about it?


Solution

  • The httpx documentation covers the ReadTimeout you are experiencing:

    HTTPX is careful to enforce timeouts everywhere by default. The default behavior is to raise a TimeoutException after 5 seconds of network inactivity. The read timeout specifies the maximum duration to wait for a chunk of data to be received (for example, a chunk of the response body). If HTTPX is unable to receive data within this time frame, a ReadTimeout exception is raised.

    Try first disabling the timeout duration for reads (adapted from the example in the above link):

    timeout = httpx.Timeout(10.0, read_timeout=None)
    response = await httpx.get(rss, headers=headers, verify=False, timeout=timeout)
    

    And then experiment with different timeout durations to see what is reasonable for your use-case.


    EDIT: the API has been updated, parameter name has been changed from read_timeout to read:

    timeout = httpx.Timeout(10.0, read=None)