Search code examples
phpxmlsimplexml

How can I escape data coming from simplexml_load_file


simplexml_load_file($htmlstring)

This is my simple pull from a third party database. We started pulling a comments section which unfortunately contains & and a few < which is barfing the xml build. Error is:

Unescaped '&lt;' not allowed in attributes values

How can I get to those incorrectly formatted results and CDATA them or something before it tried to build the XML set. I have looked all over http://php.net/manual/en/function.simplexml-load-file.php but dont seem to have the smarts to find a solution!


Solution

  • If the input file is invalid, and you can't influence the third party to fix it, your options are rather limited.

    One thing which might be worth trying is using DOM in HTML mode to load the file. This uses a more forgiving parser, but then creates the same data structure.

    The nice thing is that you don't actually have to use the DOM with all its verbosity, because you can "import" the DOM object into SimpleXML. This doesn't require any re-parsing, because both interfaces use the same data structures internally (libxml).

    From there - assuming this worked - you can carry on as though you'd just run simplexml_load_file in the first place.

    So instead of this:

    $sxml = simplexml_load_file($file_path);
    

    You'd write this:

    $dom = DOMDocument::loadHTMLFile($file_path);
    $sxml = simplexml_import_dom($dom);
    

    Then carry on as you were.

    (If you have a string of data instead of a file path, you'd be using simplexml_load_string() and DOMDocument::loadHTML() respectively.)