Asynchronous status check on multiple HTTP requests

I have a list containing a few thousands URLs pointing to images/videos on a remote server. Something like:

urls = ['https://foo.bar/baz.jpeg', 'https://foo.bar/jaz.mp4', ...]

By fetching those urls, some of the responses come as a 404 Not Found, and that's ok because the data on the server might be outdated and removed. The thing that I'm trying to do is to identify which of the urls will give me the 404, in a fast way.

When I open my browser and type one of the faulty urls on the address bar, the Not Found error takes more or less 200ms to be retrieved. By doing some innocent calculations, I expect that ~1.000 calls would take no more than 4 seconds to complete, if made in an async manner.

However, by using this code, which I believe is somehow appropriate:

def async_check(urls):

    async def fetch(session, url):
        async with session.get(url) as response:
            if response.status != 200:
                return False
            else:
                return True

    async def run(urls):
        async with ClientSession() as session:
            return await asyncio.gather(*[fetch(session, url) for url in urls])

    return asyncio.get_event_loop().run_until_complete(run(urls))

the time elapsed is much longer, and sometimes it's actually a timeout.

I believe that's due to the non-faulty urls inside the list, which point to images and videos that can take a long time to load as a response object and end up consuming a lot of time in order to complete the task.

After putting some thought on how I can achieve this verification of 404s, I came out with a flow that looks more or less like this:

For each url, asynchronously fetch it with a get method, and also asynchronously sleep for a relative long amount of time (say, 1 second). When the sleeping is done, try to see if the response is "ready" and, if yes, add to my list of faulty urls if the status code 404 (or different from 200). If after sleeping, the response is not "ready" then I will assume that it's due to a heavy image/video being loaded and consider it non faulty.

Since the upper limit of awaiting time for each call is 1 second, I expect that it would run relatively fast for a bunch of urls.

Will this be considered a neat tackle for solving this problem, or is there a smarter way of doing it?

Solution

I believe that's due to the non-faulty urls inside the list, which point to images and videos that can take a long time to load as a response object and end up consuming a lot of time in order to complete the task.

It's hard to tell in advance whether this is actually true, but you can certainly test it by adding code that uses time.time() to measure time elapsed for each request and print its status.

Note that, unless you await response.read() or equivalent, the response body is not "loaded" by the client, only the headers are. Still, it is quite possible that some non-faulty URLs take a long time to return the header. It is also possible that some faulty ones take a long time to return the error status, perhaps those you didn't check manually. asyncio.gather() will take as long as the longest URL in the list, so if you have thousands of URLs, at least some of them are bound to lag.

But assuming your premise is correct, you can implement the limit by wrapping fetch inside wait_for:

    async def fetch_with_limit(session, url):
        try:
            return await asyncio.wait_for(fetch(session, url), 1)
        except asyncio.TimeoutError:
            return True  # took more than 1s, probably non-faulty

Now you can use fetch_with_limit instead of fetch.