Search code examples
pythondataframecsvdictionaryurlretrieve

multiple image downloader using CSV file and python


I am facing an error with this code. Can anyone help me with it so I can automate the process of downloading all the images in the CSV file that contain all the URLs of the images?

The error I am getting is:

        URLError                                  Traceback (most recent call last)
      <ipython-input-320-dcd87f841181> in <module>
         19         urlShort = re.search(filejpg, str(r)).group()
         20         print(urlShort)
    ---> 21         download(x, f'{di}/{urlShort}')
         22         print(type(x))
         URLError: <urlopen error unknown url type: {'https>

This is the code I am using:

from pathlib import Path
from shutil import rmtree as delete
from urllib.request import urlretrieve as download
from gazpacho import get, Soup
import re
import pandas as pd
import numpy as np


#import data
df = pd.read_csv('urlReady1.csv')
df.shape
#locate folder
di = 'Dubai'
Path(di).mkdir(exist_ok=True)

#change data to dict
dict_copy = df.to_dict('records')

#iterate over every row of the data and download the jpg file
for r in dict_copy:
    if r == 'urlready':
        print("header")
    else:
        x = str(r)
        filejpg = "[\d]{1,}\.jpg"
        urlShort = re.search(filejpg, str(r)).group()
        print(urlShort)
        download(x, f'{di}/{urlShort}')
        print(type(x))

Solution

  • I can't see your data set, but I think pandas to_dict('records') is returning you a list of dict (which you are storing as dict_copy). Then when you iterate through that with for r in dict_copy: r isn't a URL, but a dict that contains the URL in some way. So str(r) converts that dict {<stuff>} to '{<stuff>}', and you are then sending that off as your URL.

    I think that's why you are seeing the error URLError: <urlopen error unknown url type: {'https>

    Adding a print statement after the df dump (print(dict_copy) right after dict_copy = df.to_dict('records')), and at the beginning of your iteration (print(r) right after for r in dict_copy:) would help you see what's going on and test/confirm my hypothesis.

    Thanks for adding sample data! So dict_copy is something like [{'urlReady': 'mobile.****.***.**/****/43153.jpg'}, {'urlReady': 'mobile.****.***.**/****/46137.jpg'}]

    So yes, dict_copy is a list of dict, looking like 'urlReady' as the key and a URL string as a value. So you want to retrieve the url from each dict using that key. The best approach may depend on things like whether you have stuff in the data without valid URLs, etc. But this can get you started and provide a little view of the data to see if anything is weird:

    for r in dict_copy:
        urlstr = r.get('urlReady', '') # .get with default return of '' means you know you can use string methods to validate data
        print('\nurl check: type is', type(urlstr), 'url is', urlstr)
        if type(urlstr) == str and '.jpg' in urlstr: # check to make sure the url has a jpg, you can replace with `if True` or another check if it makes sense
            filejpg = "[\d]{1,}\.jpg"
            urlShort = re.search(filejpg, urlstr).group()
            print('downloading from', urlstr, 'to', f'{di}/{urlShort}')
            download(urlstr, f'{di}/{urlShort}')
        else:
            print('bad data! dict:', r, 'urlstr:', urlstr)