Search code examples
xpathsimplexml

Selecting and manipulating mixed nodes


I have thousands of poorly formatted html documents and I have to fix formatting errors using only php. So far I do well with simplexml and xpath. Now I stumbled over this:

<ul>
  Lorem ipsum <strong>dolor sit amet,</strong> consectetur 
  adipiscing elit, <em>sed</em> do eiusmod tempor
  <li>incididunt</li>
  <li>ut</li>
  <li>labo</li>
</ul>

Now the Text Lorem…tempor belongs outside of the <ul> while everything else (incididunt…labo) should remain a list item.

So my idea was to select child nodes of <ul> that are not <li> including text nodes. But can I do this with xpath?


Solution

  • You can union two xpathes. The first finds all not li nodes, the second - text nodes under ul

    //ul/*[name() != "li"] | //ul/text()