i'm trying to download magic the gathering cards' images from scryfall.com. they provide this json file with all informations about every single card (including the url for its image). so i wrote a code that reads every url from that json file, and attemps to save it. the thing is, the request part of the code takes more than 5 minutes per image to run and i have no idea why. (the size of each image i'm fetching is less than 100kB and opens instantenously on the browser)
i have tried urllib.urlretrieve, urllib2.urlopen, and it's all the same. tried running it on both python2 and python3.
no error messages, the code actually works, only the long time it takes makes it unviable to carry on with it.
edit:
a=open("cards.json")
b=a.read()
data=[]
data.append(b)
count=0
for elem in data:
try:
content=json.loads(elem)
except:
print content
exit()
for j in content:
count=count+1
if j['layout']=='normal' and j['digital']==False:
url=str(j['image_uris']['normal'])
final=url[url.find('normal')+6:]
print (url)
print("a")
i1=urllib.urlretrieve(url)
print("b")
i2=i1.read()
file=open(str(count),'wb')
file.write(i2)
file.close()
if count>5:
exit()
edit2: the link to the json i'm using: https://archive.scryfall.com/json/scryfall-default-cards.json
This code gets image in less then 1 second
import requests
url = 'https://img.scryfall.com/cards/normal/front/2/c/2c23b39b-a4d6-4f10-8ced-fa4b1ed2cf74.jpg?1561567651'
r = requests.get(url)
with open('image.jpg', 'wb') as f:
f.write(r.content)
The same with this code
import urllib.request
url = 'https://img.scryfall.com/cards/normal/front/2/c/2c23b39b-a4d6-4f10-8ced-fa4b1ed2cf74.jpg?1561567651'
urllib.request.urlretrieve(url, 'image.jpg')
I didn't check for more images. Maybe problem is when server see too much requests from one IP in short time and then it blocks them.
EDIT: I used this code to download 10 images and display time
import urllib.request
import time
import json
print('load json')
start = time.time()
content = json.loads(open("scryfall-default-cards.json").read())
end = time.time()
print('time:', end-start)
# ---
start = time.time()
all_urls = len(content)
urls_to_download = 0
for item in content:
if item['layout'] == 'normal' and item['digital'] is False:
urls_to_download += 1
print('urls:',
all_urls, urls_to_download)
end = time.time()
print('time:', end-start)
# ----
start = time.time()
count = 0
for item in content:
if item['layout'] == 'normal' and item['digital'] is False:
count += 1
url = item['image_uris']['normal']
name = url.split('?')[0].split('/')[-1]
print(name)
urllib.request.urlretrieve(url, 'imgs/' + name)
if count >= 10:
break
end = time.time()
print('time:', end-start)
Results
load json
time: 3.9926743507385254
urls: 47237 41805
time: 0.054879188537597656
2c23b39b-a4d6-4f10-8ced-fa4b1ed2cf74.jpg
37bc0128-a8d0-477c-abcf-2bdc9e38b872.jpg
2ae1bb79-a931-4d2e-9cc9-a06862dc5cde.jpg
4889a668-0f01-4447-ad2e-91b329258f22.jpg
5b13ba5a-f4b0-420a-9e4f-a65e57721fa4.jpg
893b309d-5e8f-47fa-9f54-eaf16a5f96e3.jpg
27d30285-7729-4130-a768-71867aefe9b3.jpg
783616d6-e3ea-43fd-97eb-6e4c5a2c711f.jpg
cc101b90-3e17-4beb-a606-3e76088e362c.jpg
36da00e3-3ef6-4ad5-a53d-e71cfdafc1e6.jpg
42e1033b-383e-49b4-875f-ccdc94e08c9d.jpg
time: 2.656561851501465