I have written a script to download and save the images in a directory from the urls provided. It uses the requests
to access the URL given in a DataFrame
(CSV file) and download the images in my directory using PILLOW
. NAme of the image is the index number of that url in my CSV. If there is any bad url, which is not accessible, it just increases the counter. It starts downloading from the maximum index to desired index everytime I run the script. My code is working fine. It is something like this:
import pandas as pd
import os
from os import listdir
from os.path import isfile, join
import sys
from PIL import Image
import requests
from io import BytesIO
import argparse
arg_parser = argparse.ArgumentParser(allow_abbrev=True, description='Download images from url in a directory',)
arg_parser.add_argument('-d','--DIR',required=True,
help='Directory name where images will be saved')
arg_parser.add_argument('-c','--CSV',required=True,
help='CSV file name which contains the URLs')
arg_parser.add_argument('-i','--index',type=int,
help='Index number of column which contain the urls')
arg_parser.add_argument('-e','--end',type=int,
help='How many images to download')
args = vars(arg_parser.parse_args())
def load_save_image_from_url(url,OUT_DIR,img_name):
response = requests.get(url)
img = Image.open(BytesIO(response.content))
img_format = url.split('.')[-1]
img_name = img_name+'.'+img_format
img.save(OUT_DIR+img_name)
return None
csv = args['CSV']
DIR = args['DIR']
ind = 0
if args.get('index'):
ind = args['index']
df = pd.read_csv(csv) # read csv
indices = [int(f.split('.')[0]) for f in listdir(DIR) if isfile(join(DIR, f))] # get existing images
total_images_already = len(indices)
print(f'There are already {len(indices)} images present in the directory -{DIR}-\n')
start = 0
if len(indices):
start = max(indices)+1 # set strating index
end = 5000 # next n numbers of images to download
if args.get('end'):
end = args['end']
print(f'Downloaded a total of {total_images_already} images upto index: {start-1}. Downloading the next {end} images from -{csv}-\n')
count = 0
for i in range(start, start+end):
if count%250==0:
print(f"Total {total_images_already+count} images downloaded in directory. {end-count} remaining from the current defined\n")
url = df.iloc[i,ind]
try:
load_save_image_from_url(url,DIR,str(i))
count+=1
except (KeyboardInterrupt, SystemExit):
sys.exit("Forced exit prompted by User: Quitting....")
except Exception as e:
print(f"Error at index {i}: {e}\n")
pass
I want to add a functioning that when there is something like No internet
or connection error
occurs, instead of going forward, it stops the process for say 5 minutes. After 5 tries i.e 25 minutes, if the problem still persists, it should quit the program instead of increasing the counter. I want to add this because if there is no internet for say even 2 minutes, and comes again, it'll run through the loop and start downloading the images from that index. Next time if I run this program, it'll think that the missing URL were bad but there was just no internet connection.
How can I do this?
Note: Obviously, I am thinking about using time.sleep()
but I want to know which error directly reflects No Internet
or Connection Error
in requests
? One is from requests.exceptions import ConnectionError
If I have to use this, how can I use this to keep trying on the same i
counter until 5 attempts and then if unsuccessful, quit the program and on successful connection, run the regular loop.
Better than sleep is to use exponential backoff.
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
retry_strategy = Retry(
total=3,
status_forcelist=[429, 500, 502, 503, 504],
method_whitelist=["HEAD", "GET", "OPTIONS"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
http = requests.Session()
http.mount("https://", adapter)
http.mount("http://", adapter)
response = http.get(url)
Here, you can configure the parameters as follows:
The formula for back-off factor is as follows:
{backoff factor} * (2 ** ({number of total retries} - 1))
So backoff of 10s will be 5, 10, 20, 40, 80, 160, 320, 640, 1280, 2560 - these are the sleep times between subsequent requests