Search code examples
pythonbeautifulsoup

Parse data from local html-file using bs4?


i try to parse a local html-document using the following code -

import os, sys
from bs4 import BeautifulSoup

path = os.path.abspath(os.path.dirname(sys.argv[0])) 
fnHTML = os.path.join(path, "inp.html")
page = open(fnHTML)
soup = BeautifulSoup (page.read(), 'lxml')  

worker = soup.find("span")
wHeadLine = worker.text.strip()
wPara = worker.find_next("td").text.strip()
print(wHeadLine)
print(wPara)

The output look like that:

Find your faves—faster
We’ve made it easier than ever to see what’s on now and continue  watching your recordings, favorite teams and more.

But the text on the html looks like that - see the picture

enter image description here

Why is the text not outputed with "—" and "We’ve"?


Solution

  • This is caused by the windows default char encoding (cp1252). change it so it is utf-8: Also, use a context manager to gracefully close you file stream.

    import os, sys
    from bs4 import BeautifulSoup
    
    path = os.path.abspath(os.path.dirname(sys.argv[0])) 
    fnHTML = os.path.join(path, "inp.html")
    with open(fnHTML, encoding='utf-8') as file: #added encoding utf-8
        soup = BeautifulSoup (file.read(), 'lxml')  
        worker = soup.find("span")
    wHeadLine = worker.text.strip()
    wPara = worker.find_next("td").text.strip()
    print(wHeadLine)
    print(wPara)
    

    When opening a text file, especially on windows, you WANT to force the encoding to be utf-8, for portability. This will prevent unexpected behaviors on other OS.