Search code examples
pythonhtmlencodingutf-8html-parsing

HTML parser: convert html ISO-8859-1 encoded text to UTF-8


I am trying to build an html mail parser in python 3.7 with Beautiful Soup.

The Content-Type in the email header is: text/html; charset="iso-8859-1"

Here is some html code:

<div dir='3D"ltr"' id='3D"divRplyFwdMsg"'>
         <font color='3D"#000000"' face='=3D"Calibri,' sans-serif"="" style='3D"font-size:11pt"'>
          <b>
           Enviado:
          </b>
          jueves, 9 de mayo de 2019 11:16
          <br/>
          <b>
           Para:
          </b>
          DealReg
          <br/>
          <b>
           Asunto:
          </b>
          Integrated Quoting - Deal Registration ID 001009814954 pa=
ra Cliente client_name Revisi=F3n completa
         </font>
         <div>
         </div>

I need to encode the text correctly with UTF-8.

Where is "Integrated Quoting - Deal Registration ID 001009814954 pa= ra Cliente client_name Revisi=F3n completa" i expect "Integrated Quoting - Deal Registration ID 001009814954 para Cliente client_name Revisión completa"

I found some solutions but none of them works for me:

[1].

with codecs.open(html_path,"r", encoding = "utf-8") as html_file:
           text = html_file.read()

[2].

with io.open(html_path,"r", encoding = "utf-8") as html_file:
           text = html_file.read()

[3].

a = "Revisi=F3n"
b = a.encode("iso-8859-1").decode("utf-8")

>>>print(b)
"Revisi=F3n"

In [3] i also try to encode with ascii, latin-1, cp1252 and the result was the same.

Thanks!


Solution

  • It looks like the non-ascii characters have been encoded using the quoted printable encoding (perhaps this html is from an email?). The quopri module can be used to encode them as bytes, which may then be decoded to str.

    >>> import quopri
    >>> s = 'Revisi=F3n'      
    >>> quopri.decodestring(s)
    b'Revisi\xf3n'   # bytes
    >>> quopri.decodestring(s).decode('ISO-8859-1')
    'Revisión'
    

    The quopri.decode function will decode an entire file.