I use a PHP XMLRPC server to provide an interface to my blogging app (PDO/SQLite backend). Sending data to the database works and encoding stays intact, or at least strings with special characters such as umlauts (äöü) end up there correctly. But producing them from the database leads to problems and strings end up garbled.
Example of my setup
function get_post($id) {
// get the post from the database.
$post = load_post($id);
// if I output content here, all characters are intact, eg. "test with ümlaut"
return $post;
}
// set up a server
$server = xmlrpc_server_create();
xmlrpc_server_register_method($server, 'metaWeblog.getPost', 'get_post');
// fake a request
$request = xmlrpc_encode_request("metaWeblog.getPost", null, [
'encoding' => 'utf-8'
]);
// call get_post()
$response = xmlrpc_server_call_method($server, $request, null, [
'encoding' => 'utf-8'
]);
if($response) {
header('Content-Type: text/xml; charset=utf-8');
echo($response); // has garbled umlauts
}
Produces the wrong string test with ümlaut
instead of test with ümlaut
<member>
<name>description</name>
<value>
<string>test with ümlaut</string>
</value>
</member>
Is there some way I can make this work without resorting to a different XMLRPC library? Ideally, prevent the escaping of special characters entirely, if possible.
Any help is appreciated!
It actually works exactly like you ask for, only the escaping
output option is missing, you want markup
here (takes your Unicode UTF-8 string more or less verbatim) without any other value:
$response = xmlrpc_server_call_method($server, $request, null, [
'encoding' => 'UTF-8',
'escaping' => 'markup',
]);
The encoding
(as it will end up in the XML declaration) sets the document encoding declaration to UTF-8
.
And with markup
the plain UTF-8 string is taken and only XML markup characters (<
, >
, &
, etc.) are escaped. This is contrary to the default which would also escape non-ascii
and non-print
(able) characters as numeric entities (ü
), those which are not helpful here as you want characters that are not specifically ASCII in their original encoding. UTF-8 albeit compatible with ASCII for the subset of the first 127 code-points, uses non-ASCII characters for flagging continuation bytes with the highest bit set, so those bytes are always higher than 127.
<?xml version="1.0" encoding="UTF-8"?>
...
<member>
<name>description</name>
<value>
<string>Äpfel wachsen überirdisch.</string>
</value>
</member>
...
The escaping
output-option
You can find those options documented on the xmlrpc_encode_request(php)
manual page, as it is a bit brief, some discussion here in context of the answer:
The escaping
output-option can take a string with a single value or an array with multiple string values.
The default is ['non-ascii', 'non-print', 'markup']
, a fourth value, 'cdata'
, is available as well:
'non-ascii'
: every code-point higher than 127 (excluding) is escaped as numeric entity (XML: Character Reference); e.g. the UTF-8 ü
(u-umlaut) as ü
.'non-print'
: Every non-printable character is escaped as numeric entity. Compare RFC20. Not printable characters are all non-graphic characters, that is space (2/0, decimal 32) is printable and everything higher than 127 is not printable. Therefore, to preserve UTF-8 byte sequences, same as with 'non-ascii'
, it must not be set.'markup'
: suggested in the answer, the explanation above is likely falling short, for the details compare with XML: Markup;'cdata'
: This puts strings into XML: CDATA Sections. Not suggested in the answer but preserves UTF-8 as well and can be a fine escaping when the data contains strings that are XML, HTML or some other source code, like PHP, as well, as the response data then is easier to read by humans.Mind the NUL byte
In XML-RPC, due to the fact it is XML based and likely implementation defined as well in the underlying C library, NUL bytes terminate a string.
If there is need to retain it, as XML itself does not have a character reference for it, encode it as base64 then (see RFC4648, a proposed standard, and xmlrpc_set_type(php)
).