Search code examples
pythonhtmlbeautifulsoupencodingutf-8

Ambivalent encoding of BeautifulSoup object constructed from website with duplicate meta header. How do I make sure the encoding is not mixed up?


I have fetched data from a website using BeautifulSoup module. I know from meta header that the source encoding for this document is 'iso-8859-1'. I also know that BeutifulSoup automatically transcode to 'UTF-8' upon creation of BeautifulSoup object.

import requests
from bs4 import BeautifulSoup

url = "https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp"
r=requests.get(url)
soup_data=BeautifulSoup(r.content, 'lxml')

print(soup_data.prettify())

Unfortunately, the website has a duplicate element.

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Upon inspection of the BeautifulSoup object using prettify, I realized that BeautifulSoup converted only one of these meta tags.

<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="text/html; charset=iso-8859-1" http-eqiv="Content-Type"/>

I'm therefore confused what is the actual encoding of my BeautifulSoup object.

Also, during data processing I realized that some of text elements of this object are not properly parsed by my PyCharm console. These strings are 'iso-8859-1' code characters. Therefore, I suspect that the object is either still in ISO encoding or even worse, somehow mixed up.

['\xa0\xa0\xa0\xa0M. le président.' '\xa0\xa0\xa0\xa0M. le président.'

I have seen these ISO characters for the first time after I run a numpy function.

series = np.apply_along_axis(lambda x: x[0].get_text(), 0, [df])

Any suggestions on how to proceed from this situation? I would like to convert the object to UTF-8 (and be 100% sure it's fully in UTF-8).


Solution

  • BeautifulSoup used the ISO-8859-1 encoding to decode the r.content (a bytes object) into Unicode (a str object). A str is not encoded at all. It is made of of Unicode code points.

    It turns out the data wasn't encoded in ISO-8859-1. It was encoded in Windows-1252, a similar encoding with a few extra translations (see the hyperlinks for each).

    The requests response indicates the encoding the website used (r.encoding) and the apparent encoding using its detection code (r.apparent_encoding). Here are some differences in the actual text I found:

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.assemblee-nationale.fr/12/cri/2003-2004/20040001.asp"
    r=requests.get(url)
    print(f'{r.encoding=}')
    print(f'{r.apparent_encoding=}')
    print()
    soup_data=BeautifulSoup(r.content, 'lxml')
    print(repr(soup_data.find('a',href="http://www2.assemblee-nationale.fr/scrutins/liste/(legislature)/15/(type)/AUT").text))
    print(repr(soup_data.find('a',href="#",accesskey="0").text))
    print()
    #Using the correct encoding
    soup_data=BeautifulSoup(r.content, 'lxml', from_encoding='Windows-1252')
    print(repr(soup_data.find('a',href="http://www2.assemblee-nationale.fr/scrutins/liste/(legislature)/15/(type)/AUT").text))
    print(repr(soup_data.find('a',href="#",accesskey="0").text))
    

    Output. Note the \x85 and \x92 code points in "censure…" and "d’accessibilité" in the first instance. The (U+2026) and (U+2019) code points don't exist in ISO-8859-1 and the bytes 0x85 and 0x92 translate to U+0085 and U+0092 respectively which are unprintable control codes. I've used repr() to show them as escape codes.

    r.encoding='ISO-8859-1'
    r.apparent_encoding='Windows-1252'
    
    'Autres scrutins solennels (déclarations, motions de censure\x85)'
    'Politique d\x92accessibilité'
    
    'Autres scrutins solennels (déclarations, motions de censure…)'
    'Politique d’accessibilité'