python asynchronous async-await python-asyncio aiohttp

aiohttp download large list of pdf files

i am trying to download large number of pdf files asynchronously, python requests does not work well with async functionalities

but i am finding aiohttp hard to implement with pdf downloads, and can't find a thread for this specific task, for someone new into python async world to understand easily.

yeah it can be done with threadpoolexecutor but in this case better to keep in one thread.

this code works but need to do with 100 or so urls asynchronously

import aiohttp        
import aiofiles

async with aiohttp.ClientSession() as session:
    url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
    async with session.get(url) as resp:
        if resp.status == 200:
            f = await aiofiles.open('download_pdf.pdf', mode='wb')
            await f.write(await resp.read())
            await f.close()

Thanks in advance.

Solution

You could do try something like this. For the sake of simplicity the same dummy pdf will be downloaded multiple times to disk with different file names:

from asyncio import Semaphore, gather, run, wait_for
from random import randint

import aiofiles
from aiohttp.client import ClientSession

# Mock a list of different pdfs to download
pdf_list = [
    "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
    "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
    "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
]

MAX_TASKS = 5
MAX_TIME = 5


async def download(pdf_list):
    tasks = []
    sem = Semaphore(MAX_TASKS)

    async with ClientSession() as sess:
        for pdf_url in pdf_list:
            # Mock a different file name each iteration
            dest_file = str(randint(1, 100000)) + ".pdf"
            tasks.append(
                # Wait max 5 seconds for each download
                wait_for(
                    download_one(pdf_url, sess, sem, dest_file),
                    timeout=MAX_TIME,
                )
            )

        return await gather(*tasks)


async def download_one(url, sess, sem, dest_file):
    async with sem:
        print(f"Downloading {url}")
        async with sess.get(url) as res:
            content = await res.read()

        # Check everything went well
        if res.status != 200:
            print(f"Download failed: {res.status}")
            return

        async with aiofiles.open(dest_file, "+wb") as f:
            await f.write(content)
            # No need to use close(f) when using with statement


if __name__ == "__main__":
    run(download(pdf_list))

Keep in mind that firing multiple concurrent request to a server might get your IP banned for a period of time. In that case, consider adding a sleep call (which kind of defeats the purpose of using aiohttp) or switching to a classic sequential script. In order to keep things concurrent but kinder to the server, the script will fire max 5 requests at any given time (MAX_TASKS).