Ambivalent encoding of BeautifulSoup object constructed from website with duplicate meta header. How do I make sure the encoding is not mixed up?

I have fetched data from a website using BeautifulSoup module. I know from meta header that the source encoding for this document is 'iso-8859-1'. I also know that BeutifulSoup automatically transcode to 'UTF-8' upon creation of BeautifulSoup object.

import requests
from bs4 import BeautifulSoup

url = ""
soup_data=BeautifulSoup(r.content, 'lxml')


Unfortunately, the website has a duplicate element.

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Upon inspection of the BeautifulSoup object using prettify, I realized that BeautifulSoup converted only one of these meta tags.

<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="text/html; charset=iso-8859-1" http-eqiv="Content-Type"/>

I'm therefore confused what is the actual encoding of my BeautifulSoup object.

Also, during data processing I realized that some of text elements of this object are not properly parsed by my PyCharm console. These strings are 'iso-8859-1' code characters. Therefore, I suspect that the object is either still in ISO encoding or even worse, somehow mixed up.

['\xa0\xa0\xa0\xa0M. le président.' '\xa0\xa0\xa0\xa0M. le président.'

I have seen these ISO characters for the first time after I run a numpy function.

series = np.apply_along_axis(lambda x: x[0].get_text(), 0, [df])

Any suggestions on how to proceed from this situation? I would like to convert the object to UTF-8 (and be 100% sure it's fully in UTF-8).


  • BeautifulSoup used the ISO-8859-1 encoding to decode the r.content (a bytes object) into Unicode (a str object). A str is not encoded at all. It is made of of Unicode code points.

    It turns out the data wasn't encoded in ISO-8859-1. It was encoded in Windows-1252, a similar encoding with a few extra translations (see the hyperlinks for each).

    The requests response indicates the encoding the website used (r.encoding) and the apparent encoding using its detection code (r.apparent_encoding). Here are some differences in the actual text I found:

    import requests
    from bs4 import BeautifulSoup
    url = ""
    soup_data=BeautifulSoup(r.content, 'lxml')
    #Using the correct encoding
    soup_data=BeautifulSoup(r.content, 'lxml', from_encoding='Windows-1252')

    Output. Note the \x85 and \x92 code points in "censure…" and "d’accessibilité" in the first instance. The (U+2026) and (U+2019) code points don't exist in ISO-8859-1 and the bytes 0x85 and 0x92 translate to U+0085 and U+0092 respectively which are unprintable control codes. I've used repr() to show them as escape codes.

    'Autres scrutins solennels (déclarations, motions de censure\x85)'
    'Politique d\x92accessibilité'
    'Autres scrutins solennels (déclarations, motions de censure…)'
    'Politique d’accessibilité'