Hi I'm reading "Web Scraping with Python (2015)". I saw the following two ways of opening url, with and without using .read()
. See bs1
and bs2
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs1 = BeautifulSoup(html.read(), 'html.parser')
html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs2 = BeautifulSoup(html, 'html.parser')
bs1 == bs2 # true
print(bs1.prettify()[0:100])
print(bs2.prettify()[0:100]) # prints same thing
So is .read()
redundant? Thanks
Code on p7 of Web scpraing with python: (use .read()
)
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
Code on p15 (without .read()
)
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)
Quoting BS docs:
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
When you're using .read() method you use an "string" inteface. When you are not, you're using "filehandle" interface.
Effectively it works same way (although BS4 may read file-like object in lazy way). In your case whole content is read to string object (it's may consume more memory unnecessarily).