Search code examples
web-scrapingbeautifulsoupurlliburlopen

Avoid downloading images using Beautifulsoup and urllib.request


I am using BeautifulSoup ('lxml' parser) with urllib.request.urlopen() to get text information from a website. However, when I check the network section in my Acitivity Monitor, I see that python downloads a lot of data. This suggests that not only the text is downloaded, but the images as well.

Is it possible to avoid downloading images when webscraping with BeautifulSoup?


Solution

  • That's unlikely as images are not on the page they are in <img src="/here/goes/this/img"... The browser or urllib has to make multiple trips to where-ever the static files like JS, img, CSS are. One possible solution to reduce size is request for zipped content.

    Add "Accept-Encoding":"gzip" header to the Request object. If the server supports it, the size reduction will be good. You will then gzip.decompress() it to get string data.