I'm receiving XML files I don't have control of and I need to extract the data from them. Here is my code:
public function importXML($filePath)
{
$dom = new \DOMDocument();
$dom->load($filePath);
$xml = simplexml_import_dom($dom);
foreach ($xml->PLU as $item) {
$name = $item->NAME;
I've somewhere read that the DOMDocument() sanitezes part of the xml so its better to first load the file there and them import it via simplexml_import_dom(). As of now, this code works 70% of the time and I successfully do everything as I want, but the other 30% of the time I receive this error:
[ExceptionError] DOMDocument::load(): PCDATA invalid char Value 31 in /path/to/file.xml, line 2
I've done some digging around the question and I've found a possible solution, but in my case it doesnt:
1st option:
function utf8_for_xml($string)
{
return preg_replace ('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $string);
}
however I tried putting my $dom loaded file in it before simplexml_import_dom() but its giving the same error.
2nd option:
function stripInvalidXml($value)
{
$ret = "";
$current;
if (empty($value))
{
return $ret;
}
$length = strlen($value);
for ($i=0; $i < $length; $i++)
{
$current = ord($value{$i});
if (($current == 0x9) ||
($current == 0xA) ||
($current == 0xD) ||
(($current >= 0x20) && ($current <= 0xD7FF)) ||
(($current >= 0xE000) && ($current <= 0xFFFD)) ||
(($current >= 0x10000) && ($current <= 0x10FFFF)))
{
$ret .= chr($current);
}
else
{
$ret .= " ";
}
}
return $ret;
}
I had no luck with that either, as the error continued to occur. The XML file encoding is "WINDOWS-1251" and some of the files use Cyrilic if that can help.
Is the problem in the encoding or its something about the validity of the XML file (opening and closing tags, etc)?
Any help would be greatly appreciated.
Thanks to @NigelRen I did the following and it worked well:
private function stripInvalidXml($value)
{
$ret = "";
$current;
if (empty($value))
{
return $ret;
}
$length = strlen($value);
for ($i=0; $i < $length; $i++)
{
// For >PHP7.3 use ord($value[$i])
$current = ord($value{$i});
if (($current == 0x9) ||
($current == 0xA) ||
($current == 0xD) ||
(($current >= 0x20) && ($current <= 0xD7FF)) ||
(($current >= 0xE000) && ($current <= 0xFFFD)) ||
(($current >= 0x10000) && ($current <= 0x10FFFF)))
{
$ret .= chr($current);
}
else
{
$ret .= " ";
}
}
return $ret;
}
I used the second method for validation I've found plus opening the xml with file_get_contents and then modifiyng it:
public function importXML($filePath)
{
$content = file_get_contents($filePath);
$modified = $this->stripInvalidXml($content);
$dom = new \DOMDocument();
$dom->loadXML($modified);
$xml = simplexml_import_dom($dom);
Now the $xml is valid and can be worked on as you find suitable.