i try to parse a local html-document using the following code -
import os, sys
from bs4 import BeautifulSoup
path = os.path.abspath(os.path.dirname(sys.argv[0]))
fnHTML = os.path.join(path, "inp.html")
page = open(fnHTML)
soup = BeautifulSoup (page.read(), 'lxml')
worker = soup.find("span")
wHeadLine = worker.text.strip()
wPara = worker.find_next("td").text.strip()
print(wHeadLine)
print(wPara)
The output look like that:
Find your faves—faster
We’ve made it easier than ever to see what’s on now and continue watching your recordings, favorite teams and more.
But the text on the html looks like that - see the picture
Why is the text not outputed with "—" and "We’ve"?
This is caused by the windows default char encoding (cp1252). change it so it is utf-8: Also, use a context manager to gracefully close you file stream.
import os, sys
from bs4 import BeautifulSoup
path = os.path.abspath(os.path.dirname(sys.argv[0]))
fnHTML = os.path.join(path, "inp.html")
with open(fnHTML, encoding='utf-8') as file: #added encoding utf-8
soup = BeautifulSoup (file.read(), 'lxml')
worker = soup.find("span")
wHeadLine = worker.text.strip()
wPara = worker.find_next("td").text.strip()
print(wHeadLine)
print(wPara)
When opening a text file, especially on windows, you WANT to force the encoding to be utf-8, for portability. This will prevent unexpected behaviors on other OS.