Search code examples
pythonpython-3.xcsvbeautifulsoupurllib3

Why does my web scraper not convert the URL'S properly from a CSV for downloading?


I'm running a scraper that takes all url's from images that it can find from r/dankmemes on reddit and then converting it to a list, lastly it tries to download these files, but for some reason an error occures. Can someone please explain what I'm doing wrong, I'm new to python.

The trace back error goes back to ("line38"): urllib.request.urlretrieve(image[0],'/Users/CENSORED/Desktop/Instagrammemes/image_' + str(img_count) + ".jpg")

The Error Message:

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
Traceback (most recent call last):
  File "/Users/CENSORED/Desktop/FirstImages/scraper.py", line 38, in <module>
    urllib.request.urlretrieve(image[0],'/Users/CENSORED/Desktop/Instagrammemes/image_' + str(img_count) + ".jpg")
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 510, in open
    req = Request(fullurl, data)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 328, in __init__
    self.full_url = url
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 354, in full_url
    self._parse()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 383, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: 'h'

The code that I think is causing the problem:

with open('/Users/CENSORED/Desktop/FirstImages/file.csv') as images :
    images = csv.reader(images)
    img_count = 1
    for image in images:
        image = url.strip('\'"')
        urllib.parse.quote(':')
        urllib.request.urlretrieve(image[0],'/Users/CENSORED/Desktop/Instagrammemes/image_' + str(img_count) + ".jpg")
        img_count += 1

The text file:

    ['https://a.thumbs.redditmedia.com/JkyImC_zyl4XzE_yW-G4KOUTTFB6MRHUR3eEHvrpq64.png', 
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png', 'https://www.redditstatic.com/desktop2x/img/gold/badges/award-silver-cartoon.png',
 'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png', 
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png', 
'https://preview.redd.it/i6sdyng7n3h21.jpg?
width=640&crop=smart&auto=webp&s=1abb4b30f2b74f114f2743cf66bf3d0e7f618abf', 
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png', 
'https://i.redd.it/m9q2841su3h21.jpg', 
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png', 
'https://i.redd.it/tsp8qpamc3h21.png', 
'https://www.redditstatic.com/desktop2x/img/gold/badges/award-silver-cartoon.png',
 'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png', 
'https://external-preview.redd.it/Ho2XSQOhaHGN3LhkLnPAf2OTkXwtuBTKQ9FXgdumH-I.jpg?
width=640&crop=smart&auto=webp&s=54356f6b63ea9f51953f6a42d6c77fa4bf47df44', 
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png', 
'https://preview.redd.it/9j8389cno3h21.jpg?
width=640&crop=smart&auto=webp&s=23c0ef3307b8b8ebdc7c4bcc3d16837ad58e460a', 
'https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png', 
'https://preview.redd.it/up1ouzug13h21.jpg?
width=640&crop=smart&auto=webp&s=584bb8c90056156c3d2483d6f4b1030f7bf4e27d', 'https://a.thumbs.redditmedia.com/JkyImC_zyl4XzE_yW-G4KOUTTFB6MRHUR3eEHvrpq64.png',
 'https://styles.redditmedia.com/t5_2zmfe/styles/image_widget_3xmxw4p2gqu01.png', 
'https://b.thumbs.redditmedia.com/aRUO-zIbXgMTDVJOcxKjY8P6rGkakMdyVXn4k1VN-Mk.png', 'https://b.thumbs.redditmedia.com/iL0Rq5QLIS6xVLwoYKL8na6ZaSa9tILrBbhBlMfjVdI.png', 'https://b.thumbs.redditmedia.com/9aAIqRjSQwF2C7Xohx1u2Q8nAUqmUsHqdYtAlhQZsgE.png', 
'https://b.thumbs.redditmedia.com/voAwqXNBDO4JwIODmO4HXXkUJbnVo_mL_bENHeagDNo.png']

Solution

  • Assuming the input file (urls.txt) in the code below looks like:

    ["https://i.redd.it/m9q2841su3h21.jpg",
    "https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png",
    "https://i.redd.it/tsp8qpamc3h21.png"] 
    

    The code below downloads the images to c:\temp

    import urllib.request as req
    import json
    
    with open('urls.json') as images:
        images = json.load(images)
        for idx, image_url in enumerate(images):
           image_url = image_url.strip()
           file_name = 'c:\\temp\\{}.{}'.format(idx, 
                                                image_url.strip().split('.')[-1])
            print('About to download {} to file {}'.format(image_url, file_name))
            req.urlretrieve(image_url, file_name)
    

    Output:

    About to download https://i.redd.it/m9q2841su3h21.jpg to file c:\temp\0.jpg
    About to download https://www.redditstatic.com/desktop2x/img/renderTimingPixel.png to file c:\temp\1.png
    About to download https://i.redd.it/tsp8qpamc3h21.png to file c:\temp\2.png