Here is some HTML:
<ol><ul><li>item</li></ul></ol>
and some python 3 code with lxml
to parse it and re-print it:
import sys
from lxml import etree, html
document_root = html.fromstring(sys.stdin.read())
print(etree.tostring(document_root, encoding='unicode'))
Here is the output:
<div><ol/><ul><li>item</li></ul>
</div>
In the output, lxml
closes the ol
before the ul
starts, which changes the list structure.
Why is it doing that?
Can I get lxml
to parse HTML in such a way as to preserve the list structure?
EDIT: NOTE that this example parses fine if I replace ul
with ol
(<ol><ol><li>item</li></ol></ol>
), or if I replace ol
with ul
(<ul><ul><li>item</li></ul></ul>
). The output is unchanged from the input.
I don't have control over the HTML, it could come from anywhere.
I'm using lxml 4.6.3, installed from PyPi, and python 3.9.
OR, is there another way to parse HTML in a way that I can pull list text out of it preserving the list structure in Python?
Just so you know, I'm using lxml to drop attributes, so below is code that is closer to my use case. However, I wanted to give the smallest reproducible test case first.
Code closer to my use case:
import sys
import lxml.html.clean as clean
from lxml import etree, html
document_root = html.fromstring(sys.stdin.read())
cleaner = clean.Cleaner(safe_attrs_only=True, safe_attrs=frozenset())
cleansed = cleaner.clean_html(document_root)
# Do something with the lists in cleansed, defined by ol, ul, and li ..
print(etree.tostring(cleansed, encoding='unicode')
I think neither HTML 4 nor HTML5 allows an ul
element as a child of an ol
element. Only li
elements can be direct children.
That might be why an HTML parser builds a tree structure not representing the nesting you have in your input markup. Whether a "traditional" HTML 4 parser, like probably implemented in lxml's/libxml's HTML parser algorithm, did the same change to the structure is something I don't remember and I am not sure where to test it.
While two HTML5 validators flag your ul
as a not-allowed child of ol
, current browsers seem to preserve that nesting.