Search code examples
pythonpandasloopstimespace

Faster way to iterate over dataframe?


I have a dataframe where each row is a record and I need to send each record in the body of a post request. Right now I am looping through the dataframe to accomplish this. I am constrained by the fact that each record must be posted individually. Is there a faster way to accomplish this?


Solution

  • Iterating over the data frame is not the issue here. The issue is you have to wait for the server to response to each of your request. Network request takes eons compared to CPU time need to iterate over the data frame. In other words, your program is I/O bound, not CPU bound.

    One way to speed it up is to use coroutines. Let's say you have to make 1000 requests. Instead of firing one request, wait for the response, then fire the next request and so on, you fire 1000 requests at once and tell Python to wait until you have received all 1000 responses.

    Since you didn't provide any code, here's a small program to illustrate the point:

    import aiohttp
    import asyncio
    import numpy as np
    import time
    
    from typing import List
    
    async def send_single_request(session: aiohttp.ClientSession, url: str):
        async with session.get(url) as response:
            return await response.json()
    
    async def send_all_requests(urls: List[str]):
        async with aiohttp.ClientSession() as session:
            # Make 1 coroutine for each request
            coroutines = [send_single_request(session, url) for url in urls]
            # Wait until all coroutines have finished
            return await asyncio.gather(*coroutines)
    
    # We will make 10 requests to httpbin.org. Each request will take at least d
    # seconds. If you were to fire them sequentially, they would have taken at least
    # delays.sum() seconds to complete.
    np.random.seed(42)
    delays = np.random.randint(0, 5, 10)
    urls = [f"https://httpbin.org/delay/{d}" for d in delays]
    
    # Instead, we will fire all 10 requests at once, then wait until all 10 have
    # finished.
    t1 = time.time()
    result = asyncio.run(send_all_requests(urls))
    t2 = time.time()
    
    print(f"Expected time: {delays.sum()} seconds")
    print(f"Actual time: {t2 - t1:.2f} seconds")
    

    Output:

    Expected time: 28 seconds
    Actual time: 4.57 seconds
    

    You have to read up a bit on coroutines and how they work but for the most part, they are not too complicated for your use case. This comes with a couple caveats:

    1. All your requests must be independent from each other.
    2. The rate limit on the server must be sufficient to handle your workload. For example, if it restricts you to 2 requests per minute, there is no way around that other than upgrading to different service tier.