I'm running a scraper that takes all url's from images that it can find from r/dankmemes on reddit and then converting it to a list, lastly it tries to download these files, but for some reason an error occures. Can someone please explain what I'm doing wrong, I'm new to python.
The trace back error goes back to ("line38"): urllib.request.urlretrieve(image[0],'/Users/CENSORED/Desktop/Instagrammemes/image_' + str(img_count) + ".jpg")
The Error Message:
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
Traceback (most recent call last):
File "/Users/CENSORED/Desktop/FirstImages/scraper.py", line 38, in <module>
urllib.request.urlretrieve(image[0],'/Users/CENSORED/Desktop/Instagrammemes/image_' + str(img_count) + ".jpg")
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 247, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 510, in open
req = Request(fullurl, data)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 328, in __init__
self.full_url = url
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 354, in full_url
self._parse()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'h'
The code that I think is causing the problem:
with open('/Users/CENSORED/Desktop/FirstImages/file.csv') as images :
images = csv.reader(images)
img_count = 1
for image in images:
image = url.strip('\'"')
urllib.parse.quote(':')
urllib.request.urlretrieve(image[0],'/Users/CENSORED/Desktop/Instagrammemes/image_' + str(img_count) + ".jpg")
img_count += 1
The text file:
['https://a.thumbs.redditmedia.com/JkyImC_zyl4XzE_yW-G4KOUTTFB6MRHUR3eEHvrpq64.png',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png', 'https://www.redditstatic.com/desktop2x/img/gold/badges/award-silver-cartoon.png',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://preview.redd.it/i6sdyng7n3h21.jpg?
width=640&crop=smart&auto=webp&s=1abb4b30f2b74f114f2743cf66bf3d0e7f618abf',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://i.redd.it/m9q2841su3h21.jpg',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://i.redd.it/tsp8qpamc3h21.png',
'https://www.redditstatic.com/desktop2x/img/gold/badges/award-silver-cartoon.png',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://external-preview.redd.it/Ho2XSQOhaHGN3LhkLnPAf2OTkXwtuBTKQ9FXgdumH-I.jpg?
width=640&crop=smart&auto=webp&s=54356f6b63ea9f51953f6a42d6c77fa4bf47df44',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://preview.redd.it/9j8389cno3h21.jpg?
width=640&crop=smart&auto=webp&s=23c0ef3307b8b8ebdc7c4bcc3d16837ad58e460a',
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png',
'https://preview.redd.it/up1ouzug13h21.jpg?
width=640&crop=smart&auto=webp&s=584bb8c90056156c3d2483d6f4b1030f7bf4e27d', 'https://a.thumbs.redditmedia.com/JkyImC_zyl4XzE_yW-G4KOUTTFB6MRHUR3eEHvrpq64.png',
'https://styles.redditmedia.com/t5_2zmfe/styles/image_widget_3xmxw4p2gqu01.png',
'https://b.thumbs.redditmedia.com/aRUO-zIbXgMTDVJOcxKjY8P6rGkakMdyVXn4k1VN-Mk.png', 'https://b.thumbs.redditmedia.com/iL0Rq5QLIS6xVLwoYKL8na6ZaSa9tILrBbhBlMfjVdI.png', 'https://b.thumbs.redditmedia.com/9aAIqRjSQwF2C7Xohx1u2Q8nAUqmUsHqdYtAlhQZsgE.png',
'https://b.thumbs.redditmedia.com/voAwqXNBDO4JwIODmO4HXXkUJbnVo_mL_bENHeagDNo.png']
Assuming the input file (urls.txt) in the code below looks like:
["https://i.redd.it/m9q2841su3h21.jpg",
"https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png",
"https://i.redd.it/tsp8qpamc3h21.png"]
The code below downloads the images to c:\temp
import urllib.request as req
import json
with open('urls.json') as images:
images = json.load(images)
for idx, image_url in enumerate(images):
image_url = image_url.strip()
file_name = 'c:\\temp\\{}.{}'.format(idx,
image_url.strip().split('.')[-1])
print('About to download {} to file {}'.format(image_url, file_name))
req.urlretrieve(image_url, file_name)
Output:
About to download https://i.redd.it/m9q2841su3h21.jpg to file c:\temp\0.jpg
About to download https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png to file c:\temp\1.png
About to download https://i.redd.it/tsp8qpamc3h21.png to file c:\temp\2.png