Search code examples
pythonbeautifulsoupdownloading-website-files

Problem with files downloaded by using python


I am trying to download some jpgs from the site and save them on my hard drive, but when I do that I can't open files due to it's problem with formatting, all of these files have also 115kb for some reason.

I've tried changing the chunk size and played a little with the request(), but it didn't work. There are no errors in the shell. The website's link is proper.

url = 'http://<site>'
os.makedirs('photos', exist_ok = True)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, features="html.parser")
elem = soup.select('a img')
if elem == []:
    print('no images')
else:
    for i in range(len(elem)):
        link = elem[i].get('src')
        if link != None:
            plik = open(os.path.join('photos', os.path.basename(link)), 'wb')
            for chunk in res.iter_content(100000):
                plik.write(chunk)
            plik.close()
            print('downloaded %s' % os.path.basename(link))

Solution (in the 'for i...' loop):

url = 'http://<site>'
os.makedirs('photos', exist_ok = True)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, features="html.parser")
elem = soup.select('a img')
if elem == []:
    print('no images')
else:
    for i in range(len(elem)):
        link = url + elem[i].get('src')
        res2 = requests.get(link)
        res2.raise_for_status()
        if link != None:
            plik = open(os.path.join('photos', os.path.basename(link)), 'wb')
            for chunk in res.iter_content(100000):
                plik.write(chunk)
            plik.close()
            print('downloaded %s' % os.path.basename(link))

Solution

  • After reading the html page response and extracting the src of the image you will have to use that to make another http(s) request to stream the image from that url.

    At the moment it appears that you are trying to continue reading from the initial response.

    Note: For all links and anchors, browsers make further http request