I'm using SimpleXML to load in some xml files (which I didn't write/provide and can't really change the format of).
Occasionally (eg one or two files out of every 50 or so) they don't escape any special characters (mostly &, but sometimes other random invalid things too). This creates and issue because SimpleXML with php just fails, and I don't really know of any good way to handle parsing invalid XML.
My first idea was to preprocess the XML as a string and put ALL fields in as CDATA so it would work, but for some ungodly reason the XML I need to process puts all of its data in the attribute fields. Thus I can't use the CDATA idea. An example of the XML being:
<Author v="By Someone & Someone" />
Whats the best way to process this to replace all the invalid characters from the XML before I load it in with SimpleXML?
What you need is something that will use libxml's internal errors to locate invalid characters and escape them accordingly. Here's a mockup of how I'd write it. Take a look at the result of libxml_get_errors()
for error info.
function load_invalid_xml($xml)
{
$use_internal_errors = libxml_use_internal_errors(true);
libxml_clear_errors(true);
$sxe = simplexml_load_string($xml);
if ($sxe)
{
return $sxe;
}
$fixed_xml = '';
$last_pos = 0;
foreach (libxml_get_errors() as $error)
{
// $pos is the position of the faulty character,
// you have to compute it yourself
$pos = compute_position($error->line, $error->column);
$fixed_xml .= substr($xml, $last_pos, $pos - $last_pos) . htmlspecialchars($xml[$pos]);
$last_pos = $pos + 1;
}
$fixed_xml .= substr($xml, $last_pos);
libxml_use_internal_errors($use_internal_errors);
return simplexml_load_string($fixed_xml);
}