Search code examples
xmlparsingpearquerypath

Parsing problematic XML in Querypath (dots in elements)


I am trying to parse an NewsML (http://www.iptc.org/std/NewsML-G2/2.7/examples/LISTING2_NewsML-G2_Complete.xml) document with querypath. But I have trouble with the dots in some elements, like <body.head>.

In some firefox querypath plugins I am able to escape the dot with a backslash, but in the php pear library this does not work.

Any ideas?

(I am looking for solution within Querypath, not for workarounds)


Solution

  • In the past, I've used the Tidy PHP extension (http://us3.php.net/manual/en/book.tidy.php) to clean up HTML/XML before passing it into QueryPath.

    The XML you referenced above is pretty clean, and also pretty small.

    If the only issue is dots in element names, preprocessing with a regular expression would probably work, too. And it would be the fastest solution. I'm guessing you could do a preg_replace('/<body\./g', '<body-', $xml) and have it fixed. (That would replace body.content with body-content and so on.)