Search code examples
pythonhtmlhtml-entitieslxml.html

Using lxml.html with broken html entities?


I need to work with a page, which has an unfortunate mix of correct and incorrect HTML entities; for instance:

<i>Kristj&aacuten V&iacute;ctor</i>

This, in Firefox 67, does get interpreted correctly, eventually:

ff-htmlent1.png

... however, if we do "View Source", Firefox indicates via syntax color that something is wrong with the first HTML entity:

ff-htmlent2.png

... and indeed there is, a semicolon at the end of the HTML entity is missing - however, somehow Firefox figures it out, and renders the right character.

Now, if I try to work with that in lxml:

#!/usr/bin/env python3

import lxml.html as LH
import lxml.html.clean as LHclean

testhtmlstring = "<i>Kristj&aacuten V&iacute;ctor</i>"

myhtml = LH.fromstring( testhtmlstring )
myhtml = LHclean.clean_html( myhtml )
myitem = myhtml.xpath("//i")[0]
myitemstr = myitem.text_content()
print(myitemstr)

... the code prints out this in terminal (Ubuntu 18.04):

Kristj&aacuten Víctor

... so, obviously, the broken htmlentity did not get converted to the right character.

Is there something I can use, so I get the right character in my output string from lxml, even in case of a broken htmlentity (as Firefox does)?


Solution

  • The HTML 5 standard has specified a specific subset of entities that can be parsed without the trailing semicolon present, because these entities were historically defined with the semicolon being optional.

    The html.unescape() function explicitly supports those, use that function as a second pass to clear out this issue:

    >>> from html import unescape
    >>> unescape("Kristj&aacuten Víctor")
    'Kristján Víctor'
    

    If you install html5lib then you can have lxml behave the same, via their lxml.html.html5parser module (which wraps html5lib's own html5lib.treebuilders.etree_lxml adapter):

    >>> from lxml.html import html5parser as etree
    >>> etree.fromstring("Kristj&aacuten Víctor").text
    'Kristján Víctor'