web-scraping beautifulsoup urllib urlopen

Avoid downloading images using Beautifulsoup and urllib.request

I am using BeautifulSoup ('lxml' parser) with urllib.request.urlopen() to get text information from a website. However, when I check the network section in my Acitivity Monitor, I see that python downloads a lot of data. This suggests that not only the text is downloaded, but the images as well.

Is it possible to avoid downloading images when webscraping with BeautifulSoup?

Solution

That's unlikely as images are not on the page they are in <img src="/here/goes/this/img"... The browser or urllib has to make multiple trips to where-ever the static files like JS, img, CSS are. One possible solution to reduce size is request for zipped content.

Add "Accept-Encoding":"gzip" header to the Request object. If the server supports it, the size reduction will be good. You will then gzip.decompress() it to get string data.