I'm receiving polish text from a SOAP action that has the polish diacritics encoded as XML entities, but as far as I can tell, they are not encoded in UTF-8 but ISO-8859-1 and I'm struggling to decode them properly in NodeJS.
Example text: Borek Fałęcki
Expected decoding result: Borek Fałęcki
Current result: Borek Fałęcki
While I achieved the correct result in PHP using following code:
echo html_entity_decode('Borek Fałęcki', ENT_QUOTES | ENT_SUBSTITUTE | ENT_XML1, 'ISO-8859-1');
I'm having no luck in doing the same in NodeJS. There aren't many complete packages to help with decoding html/xml entities, I have used both entites
and html-entities
but they provide the same results, and none of them seem to have any charset settings.
const { decode, encode } = require('html-entities');
const entities = require('entities');
const txt = 'Borek Fałęcki';
console.log('html-entities decode', decode(txt));
console.log('utf8-encoding', encode('Borek Fałęcki', {
mode: 'nonAsciiPrintable',
numeric: 'decimal',
level: 'xml',
}));
console.log('entities decode', entities.decodeXML(txt));
Output:
html-entities decode Borek Fałęcki
utf8-encoding Borek Fałęcki
entities decode Borek Fałęcki
As we can see, when encoded with UTF-8 there are single entities for each character:
ł = ł
ę = ę
With ISO-8859-1, there are 2 entities per character. I have no more ideas how to achieve the same decoding result as in PHP. If there were no entities, I could just convert the encoding to UTF-8 but with entities I have no idea how to do it properly. I cannot get the other side to send me UTF-8, since this is a closed old protocol that I have no control of.
The correct XML encoding of Borek Fałęcki
is Borek Fałęcki
. The SOAP action XML that you receive is wrongly encoded.
However, the following expression converts it as needed:
Buffer.concat(
"Borek Fałęcki"
.match(/[^&]+|&#\d+;/g)
.map(c => c[0] === "&"
? Buffer.of(Number(c.substring(2, c.length - 1)))
: Buffer.from(c))
).toString()