I have thousands of poorly formatted html documents and I have to fix formatting errors using only php. So far I do well with simplexml and xpath. Now I stumbled over this:
<ul>
Lorem ipsum <strong>dolor sit amet,</strong> consectetur
adipiscing elit, <em>sed</em> do eiusmod tempor
<li>incididunt</li>
<li>ut</li>
<li>labo</li>
</ul>
Now the Text Lorem…tempor belongs outside of the <ul>
while everything else (incididunt…labo) should remain a list item.
So my idea was to select child nodes of <ul>
that are not <li>
including text nodes. But can I do this with xpath?
You can union two xpathes. The first finds all not li nodes, the second - text nodes under ul
//ul/*[name() != "li"] | //ul/text()