I need to work with a page, which has an unfortunate mix of correct and incorrect HTML entities; for instance:
<i>Kristján Víctor</i>
This, in Firefox 67, does get interpreted correctly, eventually:
... however, if we do "View Source", Firefox indicates via syntax color that something is wrong with the first HTML entity:
... and indeed there is, a semicolon at the end of the HTML entity is missing - however, somehow Firefox figures it out, and renders the right character.
Now, if I try to work with that in lxml:
#!/usr/bin/env python3
import lxml.html as LH
import lxml.html.clean as LHclean
testhtmlstring = "<i>Kristján Víctor</i>"
myhtml = LH.fromstring( testhtmlstring )
myhtml = LHclean.clean_html( myhtml )
myitem = myhtml.xpath("//i")[0]
myitemstr = myitem.text_content()
print(myitemstr)
... the code prints out this in terminal (Ubuntu 18.04):
Kristján Víctor
... so, obviously, the broken htmlentity did not get converted to the right character.
Is there something I can use, so I get the right character in my output string from lxml, even in case of a broken htmlentity (as Firefox does)?
The HTML 5 standard has specified a specific subset of entities that can be parsed without the trailing semicolon present, because these entities were historically defined with the semicolon being optional.
The html.unescape()
function explicitly supports those, use that function as a second pass to clear out this issue:
>>> from html import unescape
>>> unescape("Kristján Víctor")
'Kristján Víctor'
If you install html5lib
then you can have lxml behave the same, via their lxml.html.html5parser
module (which wraps html5lib
's own html5lib.treebuilders.etree_lxml
adapter):
>>> from lxml.html import html5parser as etree
>>> etree.fromstring("Kristján Víctor").text
'Kristján Víctor'