Search code examples
html-parsinglxml

Why is lxml closing this "ol" tag when parsing?


Here is some HTML:

<ol><ul><li>item</li></ul></ol>

and some python 3 code with lxml to parse it and re-print it:

import sys
from lxml import etree, html

document_root = html.fromstring(sys.stdin.read())
print(etree.tostring(document_root, encoding='unicode'))

Here is the output:

<div><ol/><ul><li>item</li></ul>
</div>

In the output, lxml closes the ol before the ul starts, which changes the list structure.

Why is it doing that?

Can I get lxml to parse HTML in such a way as to preserve the list structure?

EDIT: NOTE that this example parses fine if I replace ul with ol (<ol><ol><li>item</li></ol></ol>), or if I replace ol with ul (<ul><ul><li>item</li></ul></ul>). The output is unchanged from the input.

I don't have control over the HTML, it could come from anywhere.

I'm using lxml 4.6.3, installed from PyPi, and python 3.9.

OR, is there another way to parse HTML in a way that I can pull list text out of it preserving the list structure in Python?

Just so you know, I'm using lxml to drop attributes, so below is code that is closer to my use case. However, I wanted to give the smallest reproducible test case first.

Code closer to my use case:

import sys

import lxml.html.clean as clean
from lxml import etree, html

document_root = html.fromstring(sys.stdin.read())

cleaner = clean.Cleaner(safe_attrs_only=True, safe_attrs=frozenset())
cleansed = cleaner.clean_html(document_root)

# Do something with the lists in cleansed, defined by ol, ul, and li ..

print(etree.tostring(cleansed, encoding='unicode')

Solution

  • I think neither HTML 4 nor HTML5 allows an ul element as a child of an ol element. Only li elements can be direct children.

    That might be why an HTML parser builds a tree structure not representing the nesting you have in your input markup. Whether a "traditional" HTML 4 parser, like probably implemented in lxml's/libxml's HTML parser algorithm, did the same change to the structure is something I don't remember and I am not sure where to test it.

    While two HTML5 validators flag your ul as a not-allowed child of ol, current browsers seem to preserve that nesting.