I am using BeautifulSoup
('lxml'
parser) with urllib.request.urlopen()
to get text information from a website. However, when I check the network section in my Acitivity Monitor, I see that python downloads a lot of data. This suggests that not only the text is downloaded, but the images as well.
Is it possible to avoid downloading images when webscraping with BeautifulSoup?
That's unlikely as images are not on the page they are in <img src="/here/goes/this/img"..
. The browser or urllib
has to make multiple trips to where-ever the static files like JS, img, CSS are. One possible solution to reduce size is request for zipped content.
Add "Accept-Encoding":"gzip"
header to the Request
object. If the server supports it, the size reduction will be good. You will then gzip.decompress()
it to get string data.