Search code examples
pythonasynchronousasync-awaitpython-asyncioaiohttp

aiohttp download large list of pdf files


i am trying to download large number of pdf files asynchronously, python requests does not work well with async functionalities

but i am finding aiohttp hard to implement with pdf downloads, and can't find a thread for this specific task, for someone new into python async world to understand easily.

yeah it can be done with threadpoolexecutor but in this case better to keep in one thread.

this code works but need to do with 100 or so urls asynchronously

import aiohttp        
import aiofiles

async with aiohttp.ClientSession() as session:
    url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
    async with session.get(url) as resp:
        if resp.status == 200:
            f = await aiofiles.open('download_pdf.pdf', mode='wb')
            await f.write(await resp.read())
            await f.close()

Thanks in advance.


Solution

  • You could do try something like this. For the sake of simplicity the same dummy pdf will be downloaded multiple times to disk with different file names:

    from asyncio import Semaphore, gather, run, wait_for
    from random import randint
    
    import aiofiles
    from aiohttp.client import ClientSession
    
    # Mock a list of different pdfs to download
    pdf_list = [
        "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
        "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
        "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
    ]
    
    MAX_TASKS = 5
    MAX_TIME = 5
    
    
    async def download(pdf_list):
        tasks = []
        sem = Semaphore(MAX_TASKS)
    
        async with ClientSession() as sess:
            for pdf_url in pdf_list:
                # Mock a different file name each iteration
                dest_file = str(randint(1, 100000)) + ".pdf"
                tasks.append(
                    # Wait max 5 seconds for each download
                    wait_for(
                        download_one(pdf_url, sess, sem, dest_file),
                        timeout=MAX_TIME,
                    )
                )
    
            return await gather(*tasks)
    
    
    async def download_one(url, sess, sem, dest_file):
        async with sem:
            print(f"Downloading {url}")
            async with sess.get(url) as res:
                content = await res.read()
    
            # Check everything went well
            if res.status != 200:
                print(f"Download failed: {res.status}")
                return
    
            async with aiofiles.open(dest_file, "+wb") as f:
                await f.write(content)
                # No need to use close(f) when using with statement
    
    
    if __name__ == "__main__":
        run(download(pdf_list))
    

    Keep in mind that firing multiple concurrent request to a server might get your IP banned for a period of time. In that case, consider adding a sleep call (which kind of defeats the purpose of using aiohttp) or switching to a classic sequential script. In order to keep things concurrent but kinder to the server, the script will fire max 5 requests at any given time (MAX_TASKS).