I'm using aiohttp with asyncio to make a batch of requests. My first approach was to create a session inside the fetch() function (which starts an asyncio.gather job), and then passing the session object around to the functions that perform the post requests (get_info)
def batch_starter(item_list)
return_value = loop.run_until_complete(fetch(item_list))
return return_value
async def fetch(item_list):
async with aiohttp.ClientSession() as session: # <- session started here
results = await asyncio.gather(*[asyncio.ensure_future(get_info(session, item)) for item in item_list])
async def get_info(session, item): # <- session passed to the function
async with session.post("some_url", data={"id": item}) as resp:
html = await resp.json()
some_info = html.get('info')
return some_info
but thanks to my confusion, I am now leaning towards instantiating the session right away once the script is imported, like below, at the top of the file:
import asyncio
import aiohttp
import json
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
session = aiohttp.ClientSession() # <- session started at top of file
def batch_starter(item_list)
return_value = loop.run_until_complete(fetch(item_list))
return return_value
async def fetch(item_list):
results = await asyncio.gather(*[asyncio.ensure_future(get_info(item)) for item in item_list])
async def get_info(item):
async with session.post("some_url", data={"id": item}) as resp: # <- session from outer scope is used
html = await resp.json()
some_info = html.get('info')
return some_info
the docs explain that opening a session with every request is a "very bad" idea (obviously). But this is stated right after the example which does apparently exactly that (first approach)? Which one of this is correct, and how is the session going to behave when it is used like in the second approach, at the top of the file? wouldn't the session just stay open forever if I'm using the second approach?
The batch_starter() function is not going to be called a lot, but with 9000+ items in the item_list. I assumed this was already reducing the amount of sessions to 1 (per gather job), but apparently this is the "bad idea" example, and needs to be corrected? the docs are a bit unclear about this...
Solution 2 consists of creating a session in the global scope. This solution should be avoided. The aiohttp's FAQ states this explicitely here: Why is creating a ClientSession outside of an event loop dangerous?:
All asyncio object should be correctly finished/disconnected/closed before event loop shutdown. Otherwise user can get unexpected behavior. In the best case it is a warning about unclosed resource, in the worst case the program just hangs [...]
In other words, using solution 2 means your event loop will end with a coroutine pending, which can lead to bugs. Now, concerning you question:
wouldn't the session just stay open forever if I'm using the second approach?
Yes, the session will never be closed in your example. You could try to call session.close()
somewhere in your code but I'm not sure its a good idea either.
Solution 1 consists of starting a session for each batch. Depending on your use case, solution 1 might be what you need and, in any case , you would use something that look a lot like it.
If you don't care about the overhead of creating a TCP connection for every new batch, then its fine. Like you said, if you don't have too many batches and run this on your local machine to scrape data from some website/api, then it's quite sure solution 1 will suffice.
If you want to optimize for latency, then you would have to share one session for your whole application. This is what is typically done in web servers.
Creating a single shared session is quite important when your code is used on a web server. The proposed solution would not be applicable directly because of how a web server works. But the solution is not too different in terms of goals.
Also, in that case, your server should respond as fast as possible to incomming (user) requests, which means saving a bit of time is important.
The good news is, its not too hard of a problem to solve. One simple solution is to use a singleton class for your aiohttp client. We can instanciate this class in the global scope and then, we would open the session when the server starts and close it when it stops (the hard part is finding how to inject code in startup/shutdown routines).
For example, FastAPI (and Starlette) uses a lifespan context manager to handle server life cycle. We can use that lifespan option to open/close our session:
"""Entry point of your web server, usually app.py file"""
from contextlib import asynccontextmanager
from starlette.applications import Starlette
from fastapi import FastAPI
from clients import scraper
scraper = HTTPScraper()
@asynccontextmanager
async def contexts_managers(app: Starlette):
scraper.start()
try:
yield
finally:
await scraper.stop()
app = FastAPI(lifespan=contexts_managers)
@app.post("/scrape-x")
async def x_scraping_task():
async with scraper.session.get("https://url.com") as response:
json = await response.json()
The code of the HTTPScraper
class could for example look like this:
"""clients.py file"""
from typing import Optional
import aiohttp
class HTTPScraper:
_session: Optional[aiohttp.ClientSession] = None
def start(self):
self.session = aiohttp.ClientSession()
return self.session
async def stop(self):
await self.session.close()
self.session = None
If you want you could even hide the fact that a session exists and expose a simpler interface as this solution suggests.