Search code examples
pythonpandaspython-requestspython-imaging-libraryurllib

How to halt process only when something like "No Internet" or "Network Error" occurs while downloading images using requests


I have written a script to download and save the images in a directory from the urls provided. It uses the requests to access the URL given in a DataFrame (CSV file) and download the images in my directory using PILLOW. NAme of the image is the index number of that url in my CSV. If there is any bad url, which is not accessible, it just increases the counter. It starts downloading from the maximum index to desired index everytime I run the script. My code is working fine. It is something like this:

import pandas as pd

import os
from os import listdir
from os.path import isfile, join
import sys

from PIL import Image

import requests
from io import BytesIO

import argparse


arg_parser = argparse.ArgumentParser(allow_abbrev=True, description='Download images from url in a directory',)

arg_parser.add_argument('-d','--DIR',required=True,
                       help='Directory name where images will be saved')

arg_parser.add_argument('-c','--CSV',required=True,
                       help='CSV file name which contains the URLs')

arg_parser.add_argument('-i','--index',type=int,
                       help='Index number of column which contain the urls')

arg_parser.add_argument('-e','--end',type=int,
                       help='How many images to download')

args = vars(arg_parser.parse_args())


def load_save_image_from_url(url,OUT_DIR,img_name):
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    img_format = url.split('.')[-1]
    img_name = img_name+'.'+img_format
    img.save(OUT_DIR+img_name)
    return None


csv = args['CSV']
DIR = args['DIR']

ind = 0
if args.get('index'):
    ind = args['index']

df = pd.read_csv(csv) # read csv
indices = [int(f.split('.')[0]) for f in listdir(DIR) if isfile(join(DIR, f))] # get existing images

total_images_already = len(indices)
print(f'There are already {len(indices)} images present in the directory -{DIR}-\n')
start = 0
if len(indices):
    start = max(indices)+1 # set strating index
    
end = 5000 # next n numbers of images to download
if args.get('end'):
    end = args['end']

print(f'Downloaded a total of {total_images_already} images upto index: {start-1}. Downloading the next {end} images from -{csv}-\n')

count = 0
for i in range(start, start+end):
    if count%250==0:
        print(f"Total {total_images_already+count} images downloaded in directory. {end-count} remaining from the current defined\n")

    url = df.iloc[i,ind]
    try:
        load_save_image_from_url(url,DIR,str(i))
        count+=1

    except (KeyboardInterrupt, SystemExit):
        sys.exit("Forced exit prompted by User: Quitting....")

    except Exception as e:
        print(f"Error at index {i}: {e}\n")
        pass

I want to add a functioning that when there is something like No internet or connection error occurs, instead of going forward, it stops the process for say 5 minutes. After 5 tries i.e 25 minutes, if the problem still persists, it should quit the program instead of increasing the counter. I want to add this because if there is no internet for say even 2 minutes, and comes again, it'll run through the loop and start downloading the images from that index. Next time if I run this program, it'll think that the missing URL were bad but there was just no internet connection.

How can I do this?

Note: Obviously, I am thinking about using time.sleep() but I want to know which error directly reflects No Internet or Connection Error in requests? One is from requests.exceptions import ConnectionError If I have to use this, how can I use this to keep trying on the same i counter until 5 attempts and then if unsuccessful, quit the program and on successful connection, run the regular loop.


Solution

  • Better than sleep is to use exponential backoff.

    from requests.adapters import HTTPAdapter
    from requests.packages.urllib3.util.retry import Retry
    
    retry_strategy = Retry(
        total=3,
        status_forcelist=[429, 500, 502, 503, 504],
        method_whitelist=["HEAD", "GET", "OPTIONS"]
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    http = requests.Session()
    http.mount("https://", adapter)
    http.mount("http://", adapter)
    
    response = http.get(url)
    

    Here, you can configure the parameters as follows:

    1. total=3 - The total number of retry attempts to make.
    2. backoff_factor - It allows you to change how long the processes will sleep between failed requests

    The formula for back-off factor is as follows: {backoff factor} * (2 ** ({number of total retries} - 1))

    So backoff of 10s will be 5, 10, 20, 40, 80, 160, 320, 640, 1280, 2560 - these are the sleep times between subsequent requests