Search code examples
pythonurlliburllib2

Using python (urllib/urllib2) to download images is very slow


i'm trying to download magic the gathering cards' images from scryfall.com. they provide this json file with all informations about every single card (including the url for its image). so i wrote a code that reads every url from that json file, and attemps to save it. the thing is, the request part of the code takes more than 5 minutes per image to run and i have no idea why. (the size of each image i'm fetching is less than 100kB and opens instantenously on the browser)

i have tried urllib.urlretrieve, urllib2.urlopen, and it's all the same. tried running it on both python2 and python3.

no error messages, the code actually works, only the long time it takes makes it unviable to carry on with it.

edit:

a=open("cards.json")
b=a.read()

data=[]
data.append(b)

count=0
for elem in data:
    try:
        content=json.loads(elem)
    except:
        print content
        exit()
    for j in content:
        count=count+1
        if j['layout']=='normal' and j['digital']==False:
            url=str(j['image_uris']['normal'])
            final=url[url.find('normal')+6:]
            print (url)
            print("a")
            i1=urllib.urlretrieve(url)
            print("b")
            i2=i1.read()
            file=open(str(count),'wb')
            file.write(i2)
            file.close()


        if count>5:
            exit()

edit2: the link to the json i'm using: https://archive.scryfall.com/json/scryfall-default-cards.json


Solution

  • This code gets image in less then 1 second

    import requests
    
    url = 'https://img.scryfall.com/cards/normal/front/2/c/2c23b39b-a4d6-4f10-8ced-fa4b1ed2cf74.jpg?1561567651'
    r = requests.get(url)
    
    with open('image.jpg', 'wb') as f:
        f.write(r.content)
    

    The same with this code

    import urllib.request
    
    url = 'https://img.scryfall.com/cards/normal/front/2/c/2c23b39b-a4d6-4f10-8ced-fa4b1ed2cf74.jpg?1561567651'
    urllib.request.urlretrieve(url, 'image.jpg')
    

    I didn't check for more images. Maybe problem is when server see too much requests from one IP in short time and then it blocks them.


    EDIT: I used this code to download 10 images and display time

    import urllib.request
    import time
    import json
    
    print('load json')
    
    start = time.time()
    content = json.loads(open("scryfall-default-cards.json").read())
    end = time.time()
    print('time:', end-start)
    
    # ---
    
    start = time.time()
    
    all_urls = len(content)
    
    urls_to_download = 0
    for item in content:
        if item['layout'] == 'normal' and item['digital'] is False:
            urls_to_download += 1
    
    print('urls:', 
    
    all_urls, urls_to_download)
    
    end = time.time()
    print('time:', end-start)
    
    # ----
    
    start = time.time()
    count = 0
    for item in content:
        if item['layout'] == 'normal' and item['digital'] is False:
            count += 1
            url = item['image_uris']['normal']
            name = url.split('?')[0].split('/')[-1]
            print(name)
            urllib.request.urlretrieve(url, 'imgs/' + name)
        if count >= 10:
            break
    end = time.time()
    print('time:', end-start)
    

    Results

    load json
    time: 3.9926743507385254
    urls: 47237 41805
    time: 0.054879188537597656
    2c23b39b-a4d6-4f10-8ced-fa4b1ed2cf74.jpg
    37bc0128-a8d0-477c-abcf-2bdc9e38b872.jpg
    2ae1bb79-a931-4d2e-9cc9-a06862dc5cde.jpg
    4889a668-0f01-4447-ad2e-91b329258f22.jpg
    5b13ba5a-f4b0-420a-9e4f-a65e57721fa4.jpg
    893b309d-5e8f-47fa-9f54-eaf16a5f96e3.jpg
    27d30285-7729-4130-a768-71867aefe9b3.jpg
    783616d6-e3ea-43fd-97eb-6e4c5a2c711f.jpg
    cc101b90-3e17-4beb-a606-3e76088e362c.jpg
    36da00e3-3ef6-4ad5-a53d-e71cfdafc1e6.jpg
    42e1033b-383e-49b4-875f-ccdc94e08c9d.jpg
    time: 2.656561851501465