Search code examples
pythonbeautifulsouphtml-parsing

What is the difference between BeautifulSoup's site.content and site.read()?


When I use a local html file stored on my laptop,

from bs4 import BeautifulSoup
site = open('smpl.htm', 'r')
page = BeautifulSoup(site.content, 'html.parser')
print(page)

returns (in the cmd):

Traceback (most recent call last):
File "c:/~~~~~~/python/h.py", line 3, in <module>
page = BeautifulSoup(site.content, 'html.parser')
AttributeError: '_io.TextIOWrapper' object has no attribute 'content'

but by replacing site.content with site.read(), the code shows the correct HTML and performs operations on it without any problems.

However, if I get my HTML file from the web through requests, then I'll have to write site.content and not site.read() to parse it.

What is the difference between content and read() and which is appropriate for what?


Solution

  • Opening a html file on your laptop returns a TextIOWrapper which has a read() method to get the contents of the file.

    Opening a web page uses a different class with different methods - the one you reference looks to return some form of HttpResponse object with a contents string parameter.