Search code examples
phpattributeslibxml2html-entities

How to specify the default output encoding for libxml2 to prevent overzealous entity-escaping in attributes?


It seems that this problem bites me again. I've asked some time ago something similar on dba thinking it is only a PostgreSQL problem, but alas now it bothers me in php. But the common thing is the underlying libxml2 library.

My experience leads me to believe that some operations will convert all non-Latin characters in attributes values (and only in attribute values) into escaping-entities, i.e. &#xHEX;. It looks like as if inside an attribute, the writer forgets that it should default to UTF-8 and assumes ASCII. There are some manipulations that can be done to mitigate this problem (as shown in the code below), but it isn't always feasible (like inside PostgreSQL's stored function).

The code showing the problem

<?php
$xml = <<<'XML'
<?xml version='1.0' encoding='UTF-8'?>
<root><элемент атрибут="&quot;знач.&quot;">текст</элемент></root>
XML;
$r = new XMLReader();
$r->xml($xml);
do {
    $r->read();
} while ($r->nodeType != XMLReader::ELEMENT);
$r->read();
echo $r->readOuterXml(), "\n";
$n = $r->expand(new DomDocument());
echo $n->ownerDocument->saveXml($n), "\n";
$n = $r->expand(new DomDocument('1.0', 'UTF-8'));
echo $n->ownerDocument->saveXml($n), "\n";
?>

outputs

<элемент атрибут="&quot;&#x437;&#x43D;&#x430;&#x447;.&quot;">текст</элемент>
<элемент атрибут="&quot;&#x437;&#x43D;&#x430;&#x447;.&quot;">текст</элемент>
<элемент атрибут="&quot;знач.&quot;">текст</элемент>

The result I am after is the last one.

Thus the question: is there any setting or something in libxml2 to somehow globally set the default output encoding despite the input or even the omitted one?


Solution

  • This is a bug in libxml2 which I just fixed.

    Note that you still have to provide an explicit UTF-8 encoding in the XML declaration.