Search code examples
pythonbeautifulsoupurllib

what does read() in urlopen('http.....').read() do? [urllib]


Hi I'm reading "Web Scraping with Python (2015)". I saw the following two ways of opening url, with and without using .read(). See bs1 and bs2

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs1 = BeautifulSoup(html.read(), 'html.parser')

html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs2 = BeautifulSoup(html, 'html.parser')

bs1 == bs2 # true


print(bs1.prettify()[0:100])
print(bs2.prettify()[0:100]) # prints same thing

So is .read() redundant? Thanks

Code on p7 of Web scpraing with python: (use .read())

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())

Code on p15 (without .read())

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)

Solution

  • Quoting BS docs:

    To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

    When you're using .read() method you use an "string" inteface. When you are not, you're using "filehandle" interface.

    Effectively it works same way (although BS4 may read file-like object in lazy way). In your case whole content is read to string object (it's may consume more memory unnecessarily).