I am trying to build an html mail parser in python 3.7 with Beautiful Soup.
The Content-Type in the email header is: text/html; charset="iso-8859-1"
Here is some html code:
<div dir='3D"ltr"' id='3D"divRplyFwdMsg"'>
<font color='3D"#000000"' face='=3D"Calibri,' sans-serif"="" style='3D"font-size:11pt"'>
<b>
Enviado:
</b>
jueves, 9 de mayo de 2019 11:16
<br/>
<b>
Para:
</b>
DealReg
<br/>
<b>
Asunto:
</b>
Integrated Quoting - Deal Registration ID 001009814954 pa=
ra Cliente client_name Revisi=F3n completa
</font>
<div>
</div>
I need to encode the text correctly with UTF-8.
Where is "Integrated Quoting - Deal Registration ID 001009814954 pa= ra Cliente client_name Revisi=F3n completa" i expect "Integrated Quoting - Deal Registration ID 001009814954 para Cliente client_name Revisión completa"
I found some solutions but none of them works for me:
[1].
with codecs.open(html_path,"r", encoding = "utf-8") as html_file:
text = html_file.read()
[2].
with io.open(html_path,"r", encoding = "utf-8") as html_file:
text = html_file.read()
[3].
a = "Revisi=F3n"
b = a.encode("iso-8859-1").decode("utf-8")
>>>print(b)
"Revisi=F3n"
In [3] i also try to encode with ascii, latin-1, cp1252 and the result was the same.
Thanks!
It looks like the non-ascii characters have been encoded using the quoted printable encoding (perhaps this html is from an email?). The quopri module can be used to encode them as bytes
, which may then be decoded to str
.
>>> import quopri
>>> s = 'Revisi=F3n'
>>> quopri.decodestring(s)
b'Revisi\xf3n' # bytes
>>> quopri.decodestring(s).decode('ISO-8859-1')
'Revisión'
The quopri.decode function will decode an entire file.