Search code examples
pythonbeautifulsoupcharacter-encoding

UnicodeDecodeError: 'charmap' when using BeautifulSoup


I'm working with the boot camp 100 Days of code of Udemy. Currently I am working on the webscraping lesson using BeautifulSoup, however, I have not been able to complete the classes because I am getting a type error that I do not know why is happening and how to solve as the code is very simple. Here, my Python code:

from bs4 import BeautifulSoup

with open("website.html") as file:
    html_doc = file.read()

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title.name)

Here is the error

Traceback (most recent call last):
  File "C:\Users\xarss\Desktop\100 days of python\Webdev_projects\Websrapingproyect\main.py", line 12, in <module>
    html_doc = file.read()
  File "C:\Users\xarss\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 281: character maps to <undefined>

I already tried to re-install the Beautiful Soup package and I am still having the same problem and try using other HTML files and the problem persists.

<!DOCTYPE html>
<html>

<head>
    <meta charset="utf-8">
    <title>Angela's Personal Site</title>
</head>

<body>
    <h1 id="name">Angela Yu</h1>
    <p><em>Founder of <strong><a href="https://www.appbrewery.co/">The App Brewery</a></strong>.</em></p>
    <p>I am an iOS and Web Developer. I ❤️ coffee and motorcycles.</p>
    <hr>
    <h3 class="heading">Books and Teaching</h3>
    <ul>
        <li>The Complete iOS App Development Bootcamp</li>
        <li>The Complete Web Development Bootcamp</li>
        <li>100 Days of Code - The Complete Python Bootcamp</li>
    </ul>
    <hr>
    <h3 class="heading">Other Pages</h3>
    <a href="https://angelabauer.github.io/cv/hobbies.html">My Hobbies</a>
    <a href="https://angelabauer.github.io/cv/contact-me.html">Contact Me</a>
</body>

</html>

Solution

  • This is a common error which we get while opening a file if we don't know the encoding.

    One of the below methods may work.

    with open("website.html", errors="ignore") as file:
    
    with open("website.html", errors='replace') as file:
    
    with open("website.html", 'rb') as file: