Search code examples
pythonpython-requestspython-asyncio

Read websites with request parallel?


i would like to read the html-content of websites in parallel and try to use the following code - which generally works fine -

resultText = {}
start = timeit.default_timer()  
async def main():
    loop = asyncio.get_event_loop()
    futures = [
        loop.run_in_executor(
            None, 
            requests.get, 
            websites[i]
        )
        for i in range(22)
    ]
    for i, response in enumerate(await asyncio.gather(*futures)):
        # resultText.append(response.text)
        resultText[websites[i]] = response.text

loop = asyncio.get_event_loop()
loop.run_until_complete(main())
stop = timeit.default_timer()
print(f"Time for whole process: {round((stop-start)/60,2)} min")
for k,v in resultText.items():
  print(k,len(v))
print(len(resultText))

But it only works for 22 sites it seems. (when i change the for-loop from 22 to eg. 23 it stops with the followng question)

Traceback (most recent call last):
  File "C:\DEV\Fiverr\TRY\robalf\checkPages2.py", line 67, in <module>
    loop.run_until_complete(main())
  File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 642, in run_until_complete
    return future.result()
  File "C:\DEV\Fiverr\TRY\robalf\checkPages2.py", line 62, in main
    for i, response in enumerate(await asyncio.gather(*futures)):
  File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\concurrent\futures\thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\WRSPOL\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'Eine vorhandene Verbindung wurde vom Remotehost geschlossen', None, 10054, None))

How can i read more then 22 sites? (its not necessary to read all x sites in parallel - it would be for me enough to run the first 22 sites in parallel - then the next 22 and so on...) But i try to make a loop around the async-workflow it seems i also get the above error.


Solution

  • You can use httpx instead:

    import httpx
    
    async def get_stock_price_data(stock):
      client = httpx.AsyncClient() 
      stock_page = await client.get( 'https://finance.yahoo.com/quote/TSLA')
    

    There's a full article that describes await/async in detail here: https://pythonhowtoprogram.com/python-await-async-tutorial-with-real-examples-and-simple-explanations/